1 INTRODUCTION

nhanes <- read.csv("https://jmusa3.github.io/jmusa/nhanes.csv")

#CREATE MISSING VALUES in some variables

# Set seed for reproducibility
set.seed(123)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.14 * nrow(nhanes))

# Randomly select row indices to introduce missing values 
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$body.weight[missing_indices] <- NA

# Verify missing values
#sum(is.na(nhanes$body.weight))  


# Set seed for reproducibility
set.seed(456)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$cholestrol[missing_indices] <- NA


# Set seed for reproducibility
set.seed(789)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in race to NA
nhanes$race[missing_indices] <- NA

# Verify missing values

#sum(is.na(nhanes$race))  
#sum(is.na(nhanes$cholestrol)) 
#sum(is.na(nhanes$body.weight)) 

# create a copy of nhanes for MICE
nhanes1 <- nhanes
nhanes1$race <- as.factor(nhanes1$race)
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)

The National Health and Nutrition Examination Survey (NHANES) is a program conducted by the National Center for Health Statistics (NCHS), a division of the Centers for Disease Control and Prevention (CDC). NHANES has been conducted in various forms since the early 1960s, with the current continuous survey format beginning in 1999. The survey is designed to assess the health and nutritional status of adults and children in the United States through a combination of interviews, physical examinations, and laboratory tests. Data is collected from a nationally representative sample of the U.S. population using a complex, multistage probability sampling design. Participants undergo household interviews followed by health examinations at mobile examination centers (MECs), where trained medical professionals collect biological samples and conduct clinical assessments. The NHANES data provides critical insights into public health trends, helping policymakers, researchers, and healthcare professionals make informed decisions on national health priorities. The purpose of this study is to assess the health and nutritional status of the study participants with metrics such as high blood pressure.

There are 15 features and 7927 observations in this data set. These features are listed below.

  1. obs: Respondent Identification Number
  2. psu: Pseudo-PSU 1,2 psu
  3. stratum: Pseudo-stratum (01 - 49)
  4. stat.weight: Statistical weight (225.93 - 139744.9)
  5. age: Age (years)
  6. sex: 0 = Female, 1 = Male sex
  7. race: Race (1 = White; 2 = Black; 3 = Other)
  8. body.weight: Body Weight (pounds)
  9. height: Standing Height (inches)
  10. avg.sbp: Average Systolic BP (mm/Hg)
  11. avg.dbp: Average Diastolic BP (mm/Hg)
  12. past.smk: Has respondent smoked > 100 cigarettes in life (1 = Yes, 2 = No)
  13. current.smk: Does respondent smoke cigarettes now? (1 = Yes, 2 = No)
  14. cholestrol: Serum Cholesterol (mg/100ml)
  15. hbp: High Blood Pressure (0 if PEPMNK1R <= 140; 1 if PEPMNK1R > 140)

The obs, psu, stratum, and stat.weight variables are not included in this analysis because they do not directly provide health-related information.

The data set has 317 missing values for body weight and 396 missing values for cholesterol.

2 EDA

2.1 Distribution of Individual Features

The distributions of the individual features in the NHANES dataset provide a snapshot of the characteristics of the population in terms of various health and demographic variables. Understanding these distributions is essential for identifying patterns, detecting potential biases, and assessing the range and variability of each feature. This analysis helps in understanding how variables such as age, body weight, blood pressure, and lifestyle factors are spread across the dataset, which is crucial for subsequent statistical modeling and interpretation.

# Convert variables to factors with proper labels
nhanes$sex <- factor(nhanes$sex, levels = c(0, 1), labels = c("Female", "Male"))
nhanes$race <- factor(nhanes$race, levels = c(1, 2, 3), labels = c("White", "Black", "Other"))
nhanes$past.smk <- factor(nhanes$past.smk, levels = c(1, 2), labels = c("Yes", "No"))
nhanes$current.smk <- factor(nhanes$current.smk, levels = c(1, 2), labels = c("Yes", "No"))

# Create frequency tables
sex_count <- table(nhanes$sex)

race_count <- table(nhanes$race)

past.smk_count <- table(nhanes$past.smk)

current.smk_count <- table(nhanes$current.smk)


par(mfrow = c(2, 2))

# Plot the bar charts
barplot(height = sex_count, col = "steelblue", main = "Sex", xlab = "Sex", ylab = "Count")

barplot(height = race_count, col = "steelblue", main = "Race", xlab = "Race", ylab = "Count")

barplot(height = past.smk_count, col = "steelblue", main = "Past Smoker", xlab = "Past Smoker", ylab = "Count")

barplot(height = current.smk_count, col = "steelblue", main = "Current Smoker", xlab = "Current Smoker", ylab = "Count")

There are slightly more male observations than female observation in this data set. The majority of participants are white, followed by black. All participants are recorded as being past smokers, which is likely an error. There is an imbalance for past smoker. In turn, past smoker cannot be used for analysis. About half of current participants are current smokers.

par(mfrow = c(2, 2))  

# Histogram for Cholesterol
hist(nhanes$cholestrol, 
     main = "Histogram of Cholesterol", 
     xlab = "Cholesterol", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  # Control number of bins
    
     
# Histogram for Systolic BP
hist(nhanes$avg.sbp, 
     main = "Histogram of Systolic BP", 
     xlab = "Systolic BP", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
    

# Histogram for Body Weight
hist(nhanes$body.weight, 
     main = "Histogram of Body Weight", 
     xlab = "Body Weight", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
  
     
# Histogram for Height
hist(nhanes$body.weight, 
     main = "Histogram of Height", 
     xlab = "Height", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)   

Most participants do not have high blood pressure. The age of participants varies from about 20 to about 90. The height distribution is roughly symmetric. The body weight distribution is skewed right due to some high outliers. The high outliers may indicate obesity.

par(mfrow = c(2,2))

hist(nhanes$avg.sbp, col = "steelblue", 
     main = "Systolic BP",
     xlab = "Systolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$avg.dbp, col = "steelblue", 
     main = "Diastolic BP",
     xlab = "Diastolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$cholestrol, col = "steelblue", 
     main = "Cholesterol",
     xlab = "Cholesterol", ylab = "Frequency",
     breaks = 30)

Systolic BP and cholesterol have skewed right distributions. Diastolic BP is roughly symmetric. Of the 7927 participants, 1964 are obese.

2.2 Relationship between Features

Examining relationships between variables using scatterplots and a correlation matrix provides a comprehensive approach to understanding associations. Scatterplots visually reveal how two continuous variables interact, allowing us to identify trends, correlations, and potential outliers. Meanwhile, a correlation matrix quantifies the strength and direction of relationships between multiple variables at once, helping to pinpoint which pairs are strongly correlated. Together, scatterplots and a correlation matrix offer valuable insights into the underlying patterns in the data, guiding further statistical analysis and model selection.

# Set seed for reproducibility
set.seed(123)

# Sample 500 rows (adjust the number as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Create a pairwise scatterplot matrix with the sampled data
pairs(sampled_data, 
      main = "Pairwise Scatterplots (Bootstrapped Sample)",
      pch = 19,           # Shape of the points (circle)
      col = rgb(0, 0, 1, 0.5)  # Color with transparency (blue)
)

nhanes_subset <- nhanes[, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Compute the correlation matrix
cor_matrix <- cor(nhanes_subset, use = "pairwise.complete.obs")

# Print the correlation matrix
print(cor_matrix)
                    age body.weight      height    avg.sbp   avg.dbp cholestrol
age          1.00000000 -0.01377966 -0.08538971 0.55641362 0.1143690  0.2738272
body.weight -0.01377966  1.00000000  0.45314709 0.13693751 0.2900425  0.1032280
height      -0.08538971  0.45314709  1.00000000 0.01379185 0.2060106 -0.0631896
avg.sbp      0.55641362  0.13693751  0.01379185 1.00000000 0.5213667  0.2306877
avg.dbp      0.11436905  0.29004249  0.20601064 0.52136671 1.0000000  0.1706567
cholestrol   0.27382718  0.10322800 -0.06318960 0.23068766 0.1706567  1.0000000
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)

The most highly correlate features are age and average systolic blood pressure, body weight and height, average systolic blood pressure and average diastolic blood pressure, age and cholesterol, body weight and cholesterol, and body weigh and average diastolic blood pressure. The pairwise relationships are displayed with individual scatter plots.

# Set seed for reproducibility
set.seed(123)

# Bootstrap a smaller sample (adjust size as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, ]

# Set up a 2x3 plotting grid
par(mfrow = c(2, 3))

# Scatterplot for Age and Avg. Systolic Blood Pressure
plot(sampled_data$age, sampled_data$avg.sbp, main = "Age vs Avg. Systolic BP", 
     xlab = "Age", ylab = "Avg. Systolic BP", col = "blue", pch = 19)

# Scatterplot for Body Weight and Height
plot(sampled_data$body.weight, sampled_data$height, main = "Body Weight vs Height", 
     xlab = "Body Weight", ylab = "Height", col = "red", pch = 19)

# Scatterplot for Avg. Systolic Blood Pressure and Avg. Diastolic Blood Pressure
plot(sampled_data$avg.sbp, sampled_data$avg.dbp, main = "Avg. Systolic vs Avg. Diastolic BP", 
     xlab = "Avg. Systolic BP", ylab = "Avg. Diastolic BP", col = "forestgreen", pch = 19)

# Scatterplot for Age and Cholesterol
plot(sampled_data$age, sampled_data$cholestrol, main = "Age vs Cholesterol", 
     xlab = "Age", ylab = "Cholesterol", col = "purple", pch = 19)

# Scatterplot for Avg. Systolic BP and Cholesterol
plot(sampled_data$avg.sbp, sampled_data$cholestrol, main = "Avg Systolic BP vs Cholesterol", 
     xlab = "Cholesterol", ylab = "Avg.SBP", col = "gold", pch = 19)

# Scatterplot for Body Weight and Avg. Diastolic Blood Pressure
plot(sampled_data$body.weight, sampled_data$avg.dbp, main = "Body Weight vs Avg. Diastolic BP", 
     xlab = "Body Weight", ylab = "Avg. Diastolic BP", col = "maroon", pch = 19)

Six of the most correlated pairwise relationship are visualized in scatter plots. All pairwise relationships are linear with no apparent clustering.

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

# Create a contingency table
sex_hbp_table <- table(nhanes$sex, nhanes$hbp)

# Generate mosaic plot
mosaicplot(sex_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and High Blood Pressure",
           xlab = "Sex Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
race_hbp_table <- table(nhanes$race, nhanes$hbp)

# Generate mosaic plot
mosaicplot(race_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Race and High Blood Pressure",
           xlab = "Race", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
current.smk_hbp_table <- table(nhanes$current.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(current.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and HBP",
           xlab = "Current Smoking Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
past.smk_hbp_table <- table(nhanes$past.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(past.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Past Smoking Status and HBP",
           xlab = "Past Smoking Status", 
           ylab = "HBP Status",
           border = "black")

Relationship between high blood pressure and race, sex, and current smoking status can be visualized in the mosaic plots. There appears to be an association between high current smoking and race with high blood pressure. A less pronounced relationship can been seen between sex and race with high blood pressure.

# Set up a 2x2 plotting grid
par(mfrow = c(1, 2))


# Create a contingency table
current.smk_race_table <- table(nhanes$race, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Current Smoke Status",
           xlab = "Race", 
           ylab = "Current Smoke",
           border = "black")


# Create a contingency table
current.smk_sex_table <- table(nhanes$sex, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Current Smoke Status",
           xlab = "Sex", 
           ylab = "Current Smoke",
           border = "black")

# Create a contingency table
past.smk_race_table <- table(nhanes$race, nhanes$past.smk)

An association exists between race and current smoking. Black individuals tend to smoke more than white or other individuals. An association exists between sex and current smoke status. Female tend to smoke currently slight more than males in this population.

3 MISSING VALUES

3.1 Imputation for Categorical Features

The distribution of missing values for is examined to determine any patterns. This step is done to ensure that value are missing at random.

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

#sum(is.na(nhanes$race))

#mean(is.na(nhanes$race))

#table(nhanes$race, useNA = "ifany")


#ggplot(nhanes, aes(x = factor(race))) +
  #geom_bar(fill = "blue") +
  #labs(title = "Distribution of Race (Before Imputation)", x = "Race", y = "Count") +
  #theme_minimal()


aggr(nhanes, col = c("navyblue", "red"), numbers = TRUE, sortVars = TRUE, 
     labels = names(nhanes), cex.axis = 0.7, gap = 2, 
     ylab = c("Missing Data Overview", "Pattern of Missing Data"))


 Variables sorted by number of missings: 
    Variable     Count
        race 0.1499937
  cholestrol 0.1499937
 body.weight 0.1400278
           X 0.0000000
         obs 0.0000000
         psu 0.0000000
     stratum 0.0000000
 stat.weight 0.0000000
         age 0.0000000
         sex 0.0000000
      height 0.0000000
     avg.sbp 0.0000000
     avg.dbp 0.0000000
    past.smk 0.0000000
 current.smk 0.0000000
         hbp 0.0000000

A pattern of missing values exists between body weight, BMI, and obese. This pattern can be explained by the fact that body weight is used to calculate BMI, which is then used to classify observations for obese. However, the pattern of missing value appears to be random for the other features.

No pattern of missing values exists between race and other features, In turn, kNN imputation is used for missing values of race.

sum(is.na(nhanes$race))
[1] 1189
table(nhanes$race, useNA = "ifany")  # Check distribution including NAs

White Black Other  <NA> 
 4693  1858   187  1189 
# Perform KNN Imputation (single imputation)
nhanes_imputed <- kNN(nhanes, variable = "race", k = 3)

# Keep only the original columns (without extra columns added by kNN)
nhanes_imputed <- nhanes_imputed[, names(nhanes)]



# COMPARE before and after imputation distribution

par(mfrow = c(1,2))  # Split plotting area into two

# Convert race to a factor for better visualization
nhanes$race <- as.factor(nhanes$race)
nhanes_imputed$race <- as.factor(nhanes_imputed$race)

# Create a new column to indicate whether the data is original or imputed
nhanes$source <- "Before Imputation"
nhanes_imputed$source <- "After Imputation"

# Combine both datasets
nhanes_compare <- bind_rows(nhanes, nhanes_imputed)

# Plot the distribution before and after imputation
ggplot(nhanes_compare, aes(x = race, fill = source)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Race Before and After Imputation",
       x = "Race",
       y = "Count",
       fill = "Dataset") +
  theme_minimal()

imputed_nhanes <- nhanes_imputed

The overall distribution of race after imputation changed slightly after imputation.

3.2 Regression-Based Imputation for Numerical Feature

The pattern of missing values for body weight and cholesterol is examined for any potential patterns.

#vis_miss(nhanes[, c("body.weight", "cholestrol")])


# Set smaller font size for the plot
par(cex = 0.4)  # Adjust this value for desired font size

# Generate the missing data pattern plot

md.pattern(imputed_nhanes)

     X obs psu stratum stat.weight age sex race height avg.sbp avg.dbp past.smk
5792 1   1   1       1           1   1   1    1      1       1       1        1
1025 1   1   1       1           1   1   1    1      1       1       1        1
946  1   1   1       1           1   1   1    1      1       1       1        1
164  1   1   1       1           1   1   1    1      1       1       1        1
     0   0   0       0           0   0   0    0      0       0       0        0
     current.smk hbp source body.weight cholestrol     
5792           1   1      1           1          1    0
1025           1   1      1           1          0    1
946            1   1      1           0          1    1
164            1   1      1           0          0    2
               0   0      0        1110       1189 2299
# Reset graphics settings (optional)
par(cex = 1)

A pattern of missing values exists between body weight, BMI, and obese because

Body weight is correlated with height, average diastolic blood pressure, and sex. In turn, the missing body weight values will be imputed using a regression model using said explanatory features.

# Subset nhanes to contain only complete observations for selected variables
complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("body.weight", "height", "avg.dbp", "sex")]), ]

# Fit the linear regression model
model_body.weight <- lm(body.weight ~ height + avg.dbp + sex, data = complete_nhanes)

# View the model summary
summary(model_body.weight)

Call:
lm(formula = body.weight ~ height + avg.dbp + sex, data = complete_nhanes)

Residuals:
    Min      1Q  Median      3Q     Max 
-86.131 -22.276  -4.259  17.073 269.863 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -207.90214    9.27828 -22.407  < 2e-16 ***
height         4.86582    0.14277  34.080  < 2e-16 ***
avg.dbp        0.76728    0.03913  19.606  < 2e-16 ***
sexMale       -7.00000    1.07917  -6.486  9.4e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.52 on 6813 degrees of freedom
Multiple R-squared:  0.2495,    Adjusted R-squared:  0.2491 
F-statistic: 754.9 on 3 and 6813 DF,  p-value: < 2.2e-16
# Plot the residuals
plot(model_body.weight$residuals)

The residual plot shows no patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied.

# Create a copy of the dataset to work with
#imputed_nhanes <- complete_nhanes

colSums(is.na(imputed_nhanes))
          X         obs         psu     stratum stat.weight         age 
          0           0           0           0           0           0 
        sex        race body.weight      height     avg.sbp     avg.dbp 
          0           0        1110           0           0           0 
   past.smk current.smk  cholestrol         hbp      source 
          0           0        1189           0           0 
# Identify the rows with missing values for body.weight
missing_body_weight <- is.na(imputed_nhanes$body.weight)

# Use the model to predict the missing values
imputed_nhanes$body.weight[missing_body_weight] <- predict(model_body.weight, 
                                                           newdata = imputed_nhanes[missing_body_weight, c("height", "avg.dbp", "sex", "age")])

# Check the dataset
#head(imputed_nhanes)

# Create a plot to overlay the density curves
ggplot() +
  geom_density(data = complete_nhanes, aes(x = body.weight), fill = "blue", alpha = 0.2, na.rm = TRUE) + 
  geom_density(data = imputed_nhanes, aes(x = body.weight), fill = "red", alpha = 0.2, na.rm = TRUE) +
  labs(title = "Density Plot of Body Weight: Before and After Imputation",
       x = "Body Weight", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Body Weight", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))

The distribution before and after imputation is similar. In turn, the imputation did not change the distribution of body weight.

The features that are most correlated to cholesterol are age, average systolic blood pressure, and body weight. In turn, the missing values of cholesterol will be imputed with a model using these features.

# Subset nhanes to include only rows with complete cases for cholestrol, age, avg.sbp, and BMI
complete_nhanes <- imputed_nhanes[!is.na(imputed_nhanes$cholestrol) & !is.na(imputed_nhanes$age) & 
                           !is.na(imputed_nhanes$avg.sbp) & !is.na(imputed_nhanes$body.weight), ]

complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("cholestrol", "age", "avg.sbp", "body.weight")]), ]

# Fit the linear regression model to predict cholestrol
model_cholestrol <- lm(cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
summary(model_cholestrol)

Call:
lm(formula = cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)

Residuals:
    Min      1Q  Median      3Q     Max 
-162.16  -28.18   -3.08   24.08  498.54 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 135.10667    3.96327  34.090  < 2e-16 ***
age           0.54613    0.03465  15.762  < 2e-16 ***
avg.sbp       0.21908    0.03267   6.707 2.15e-11 ***
body.weight   0.10341    0.01437   7.198 6.76e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.59 on 6734 degrees of freedom
Multiple R-squared:  0.09077,   Adjusted R-squared:  0.09037 
F-statistic: 224.1 on 3 and 6734 DF,  p-value: < 2.2e-16
# View the model summary
summary(model_cholestrol)

Call:
lm(formula = cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)

Residuals:
    Min      1Q  Median      3Q     Max 
-162.16  -28.18   -3.08   24.08  498.54 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 135.10667    3.96327  34.090  < 2e-16 ***
age           0.54613    0.03465  15.762  < 2e-16 ***
avg.sbp       0.21908    0.03267   6.707 2.15e-11 ***
body.weight   0.10341    0.01437   7.198 6.76e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.59 on 6734 degrees of freedom
Multiple R-squared:  0.09077,   Adjusted R-squared:  0.09037 
F-statistic: 224.1 on 3 and 6734 DF,  p-value: < 2.2e-16
# Create a copy of nhanes for imputation
#imputed_nhanes <- nhanes

# Identify rows with missing cholestrol values
missing_cholestrol <- is.na(imputed_nhanes$cholestrol)

# Use the model to predict the missing cholestrol values
imputed_nhanes$cholestrol[missing_cholestrol] <- predict(model_cholestrol, 
                                                          newdata = imputed_nhanes[missing_cholestrol, c("age", "avg.sbp", "body.weight")])

All explanatory features are significant.

# Get residuals from the model for complete_nhanes
residuals_cholestrol <- residuals(model_cholestrol)

# Predicted values for complete_nhanes
predicted_cholestrol <- predict(model_cholestrol, newdata = complete_nhanes)

# Create a data frame for residuals and predicted values
residuals_data <- data.frame(
  predicted_cholestrol = predicted_cholestrol,
  residuals_cholestrol = residuals_cholestrol
)

# Create the residual plot
library(ggplot2)
ggplot(residuals_data, aes(x = predicted_cholestrol, y = residuals_cholestrol)) +
  geom_point(color = "blue") +  # Plot the residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal line at 0
  labs(title = "Residual Plot for Cholestrol Prediction",
       x = "Predicted Cholestrol",
       y = "Residuals") +
  theme_minimal()

# Create the overlay density plot
ggplot() +
  geom_density(data = nhanes, aes(x = cholestrol), fill = "blue", alpha = 0.2, na.rm = TRUE) +  # Before imputation
  geom_density(data = imputed_nhanes, aes(x = cholestrol), fill = "red", alpha = 0.2, na.rm = TRUE) +  # After imputation
  labs(title = "Density Plot of Cholestrol: Before and After Imputation",
       x = "Cholestrol", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Cholestrol", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))

The distribution of cholesterol did not change significantly after imputation. As well, the residual plot does not display any patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied.

3.3 Feature Engineer

In turn a feature which calculates BMI is created to determine observations are classified as obese. According to the CDC (https://www.cdc.gov/nccdphp/dnpao/data-trends-maps/help/npao_dtm/definitions.html), Adult obesity is defined as body mass index (BMI) ≥ 30.0. Another feature named obese, based on BMI, is created, where 0 = no and 1 = yes.

# Create BMI feature
imputed_nhanes$BMI <- (imputed_nhanes$body.weight * 703) / (imputed_nhanes$height * imputed_nhanes$height)
# Round the BMI to the nearest ones place
imputed_nhanes$BMI <- round(imputed_nhanes$BMI, 0)

# Create the obese feature based on BMI
imputed_nhanes$obese <- ifelse(imputed_nhanes$BMI >= 30, 1, 0)
par(mfrow = c(1, 2))

# Histogram for BMI

hist(imputed_nhanes$BMI, 
     main = "Histogram of BMI", 
     xlab = "BMI", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)

# bar chart for obese
nhanes$obese <- factor(imputed_nhanes$obese, levels = c(0, 1), labels = c("No", "Yes"))

obese_count <- table(imputed_nhanes$obese)

barplot(height = obese_count, col = "steelblue", main = "Obesity", xlab = "Obesity", ylab = "Count")

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))


# Create a contingency table
obese_hbp_table <- table(imputed_nhanes$obese, imputed_nhanes$hbp)

# Generate mosaic plot
mosaicplot(obese_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Obesity and High Blood Pressure",
           xlab = "Obesity Status", 
           ylab = "HBP Status",
           border = "black")

# Create a contingency table
obese_sex_table <- table(imputed_nhanes$sex, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Obesity",
           xlab = "Obesity Status", 
           ylab = "Sex",
           border = "black")


# Create a contingency table
obese_race_table <- table(imputed_nhanes$race, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Obesity",
           xlab = "Obesity Status", 
           ylab = "Race",
           border = "black")


# Create a contingency table
obese_smoke_table <- table(imputed_nhanes$current.smk, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_smoke_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and Obesity",
           xlab = "Obesity Status", 
           ylab = "Current Smoking Status",
           border = "black")

3.4 Multiple Imputation

Multiple Imputation by Chained Equations (MICE) is a powerful technique used to handle missing data by creating multiple plausible imputations for each missing value based on relationships in the observed data. For body weight, cholesterol, and race, MICE can be applied to generate accurate imputations, accounting for correlations between these variables and other features in the dataset. This method allows for more reliable statistical analyses by reducing bias and preserving variability in the imputed values, making it especially useful for handling missing data in incoming new data sets.

nhanes1$race <- as.factor(nhanes1$race)  
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)
nhanes1$body.weight <- as.numeric(nhanes1$body.weight)

nhanes2 <- nhanes1


# 1 initialization

init <- mice(nhanes2, maxit=0)
print(init$method)
          X         obs         psu     stratum stat.weight         age 
         ""          ""          ""          ""          ""          "" 
        sex        race body.weight      height     avg.sbp     avg.dbp 
         ""   "polyreg"       "pmm"          ""          ""          "" 
   past.smk current.smk  cholestrol         hbp 
         ""          ""       "pmm"          "" 
# 2 imputation models

imp <- mice(nhanes2, method = c("", "", "", "", "", "", "", "polyreg", "pmm", "", "", "", "", "", "pmm", ""), 
            maxit=10,
            m=1,
            seed = 123,
            print=F)

complete(imp, action = 1)[1:10,]
    X obs psu stratum stat.weight age sex race body.weight height avg.sbp
1   3   9   2      43    19451.83  48   0    1       149.7   61.8     131
2   5  11   2      40     1245.52  48   1    1       155.3   66.2     120
3   6  19   1      35     3860.97  44   1    2       189.6   70.2     133
4   7  34   1      13     5031.70  42   0    2       125.8   62.6     100
5  10  48   1      24    26919.29  56   0    1       239.9   67.6     128
6  12  51   1      28     2730.56  44   1    1       317.9   71.1     130
7  15  55   2      44      804.03  48   1    1       245.6   68.0     155
8  19  70   1       9     2629.39  63   1    2       202.9   67.8     137
9  20  71   1      43     1147.32  37   0    1       151.7   61.7     128
10 22  73   1      29     3991.81  42   1    3       205.0   68.4     148
   avg.dbp past.smk current.smk cholestrol hbp
1       73        1           2        236   0
2       70        1           2        260   0
3       85        1           1        187   0
4       67        1           1        216   0
5       73        1           2        156   0
6       86        1           1        162   0
7       91        1           1        212   1
8       68        1           1        186   0
9       70        1           1        209   0
10      83        1           1        267   1
# 3 multiple (iterative) imputations

imp5 <- mice(nhanes2, method = "pmm", m = 5, maxit = 10, seed = 123, print=F)
plot(imp5)

Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the body weight variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for body weight earlier in this analysis, providing a measure of the imputation accuracy.

model5_bodyweight <- with(imp5, lm(body.weight ~ height + avg.dbp + sex))
summary.stats = summary(model5_bodyweight)

summary(pool(model5_bodyweight))
         term     estimate  std.error  statistic        df       p.value
1 (Intercept) -205.7045484 8.74935482 -23.510825 2473.3145 1.747163e-110
2      height    4.8466598 0.13536721  35.803796 1634.9416 8.934110e-208
3     avg.dbp    0.7523449 0.03941721  19.086712  157.8342  7.501512e-43
4         sex   -6.6078962 1.06330598  -6.214482  276.0479  1.892089e-09
beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)
     pool.se.intercept
[1,]          123.2177

The model can be used to predict body weight for incoming new data.

The MSE for single body weight imputation used earlier in this analysis is MSE = (RSE)^2 = (33.52)^2 = 1123.6 which is substantially higher than MSE when using MICE. In turn, MICE is recommended in this analysis.

Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the cholesterol variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for cholesterol earlier in this analysis, providing a measure of the imputation accuracy.

model5_chol <- with(imp5, lm(cholestrol ~ age + avg.sbp))
summary.stats = summary(model5_chol)


summary(pool(model5_chol))
         term    estimate  std.error statistic        df      p.value
1 (Intercept) 148.6912121 3.65965489 40.629845  74.24158 1.827860e-52
2         age   0.5233278 0.03407744 15.357017 236.74859 2.129193e-37
3     avg.sbp   0.2585339 0.03425598  7.547118  63.62530 2.094322e-10
beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)
     pool.se.intercept
[1,]          3.659655

The model can be used to predict cholesterol for incoming new data.

The Mean Squared Error (MSE) from the Residual Standard Error (RSE) given in the model output from the cholesterol single imputation method used early in this analysis is MSE = (42.59)^2 = 1813.9. Thus, the Mean Squared Error (MSE) = 1813.9.

The MSE using MICE is 3.66 which is substantially less. Therefore, MICE is recommended in the analysis of this data set.

For the missing values in the race variable, Multiple Imputation by Chained Equations (MICE) will be applied using multinomial logistic regression. This method will account for the categorical nature of the race variable, allowing for appropriate imputation based on the relationships with other features in the dataset.

model_race <- with(imp5, multinom(race ~ avg.dbp + avg.sbp + cholestrol + hbp))
# weights:  18 (10 variable)
initial  value 8708.699612 
iter  10 value 5523.235718
final  value 5495.930272 
converged
# weights:  18 (10 variable)
initial  value 8708.699612 
iter  10 value 5555.595871
final  value 5522.606318 
converged
# weights:  18 (10 variable)
initial  value 8708.699612 
iter  10 value 5555.946819
final  value 5529.649391 
converged
# weights:  18 (10 variable)
initial  value 8708.699612 
iter  10 value 5560.173296
final  value 5529.439901 
converged
# weights:  18 (10 variable)
initial  value 8708.699612 
iter  10 value 5542.550592
final  value 5510.127472 
converged
# Pool the results from the imputed datasets
pooled_race <- pool(model_race)

# Get a summary of the pooled results
summary(pooled_race)
          term     estimate    std.error  statistic         df      p.value
1  (Intercept) -1.277932787 0.3325448882 -3.8428881   49.56081 3.468517e-04
2      avg.dbp  0.030728414 0.0030785469  9.9814670  343.89941 9.007490e-21
3      avg.sbp -0.009072042 0.0025775079 -3.5196952  252.21602 5.124609e-04
4   cholestrol -0.004119817 0.0007738102 -5.3240668   28.44201 1.091924e-05
5          hbp  0.148914062 0.1127433356  1.3208236  101.89054 1.895191e-01
6  (Intercept) -2.575346290 0.9149529588 -2.8147308   40.55536 7.498473e-03
7      avg.dbp  0.019877690 0.0080743228  2.4618399 3279.39165 1.387374e-02
8      avg.sbp -0.007708776 0.0071499482 -1.0781583  208.24247 2.822104e-01
9   cholestrol -0.005554712 0.0020814684 -2.6686508   36.04555 1.134283e-02
10         hbp -0.151093362 0.3197542727 -0.4725296  109.96948 6.374852e-01

The model can be used to predict race for incoming new data.

4 FEATURE TRANSFORMING

Based on the distribution patterns observed in the NHANES data set, several feature engineering tasks can be performed to enhance the data’s usability for modeling. First, for age, which is roughly uniform, no major transformation is needed, though we may consider binning it into age groups if it helps with interpretability. For weight and systolic blood pressure (BP), which are both skewed right, a log transformation could be applied to reduce skewness and make these variables more normally distributed, improving model performance. Similarly, for cholesterol, which is slightly skewed right, a log transformation may also be useful. For diastolic BP, which is roughly symmetric, no transformation is necessary, but checking for outliers could be helpful. These transformations aim to stabilize variance, reduce the impact of extreme values, and create features that are more suitable for modeling with techniques like linear regression or other statistical methods.

#Log transformation to reduce skew

imputed_nhanes$log_body.weight <- log(imputed_nhanes$body.weight)
imputed_nhanes$log_avg.sbp <- log(imputed_nhanes$avg.sbp)
imputed_nhanes$log_cholestrol <- log(imputed_nhanes$cholestrol)
imputed_nhanes$log_BMI <- log(imputed_nhanes$BMI)




# Set up the plotting area to show 2x2 grid of histograms
par(mfrow = c(2, 2))  # This will create a 2x2 grid of plots

# Histogram for Log-Cholesterol
hist(imputed_nhanes$log_cholestrol, 
     main = "Histogram of Log(Cholesterol)", 
     xlab = "Log(Cholesterol)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  
    
# Histogram for Log-Body Weight
hist(imputed_nhanes$log_body.weight, 
     main = "Histogram of Log(Body Weight)", 
     xlab = "Log(Body Weight)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20) 


# Histogram for Log-Systolic BP
hist(imputed_nhanes$log_avg.sbp, 
     main = "Histogram of Log(Systolic BP)", 
     xlab = "Log(Systolic BP)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
   

# Histogram for Log-BMI
hist(imputed_nhanes$log_BMI, 
     main = "Histogram of Log(BMI)", 
     xlab = "Log(BMI)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)

The distributions of these previously skewed features are more symmetric after log transformations.

5 STANDARDIZING

Standardizing features ensures that variables with different scales, such as age, BMI, height, and cholesterol, contribute equally to the analysis. Without standardization, variables with larger ranges could dominate model performance, leading to biased results. By standardizing these features, we improve model accuracy, enable better comparison between variables, and enhance the efficiency of machine learning algorithms, particularly in models that rely on distance-based or gradient-based methods.

# Specify the columns to be standardized
columns_to_standardize <- c("age", "BMI", "height", "body.weight", "cholestrol", "avg.dbp", "avg.sbp")

# Standardize the selected columns
imputed_nhanes[columns_to_standardize] <- scale(imputed_nhanes[columns_to_standardize])

6 FEATURE SELECTION

Feature selection is a critical step in model building, as it helps improve model performance by identifying the most relevant predictors and reducing overfitting. In this analysis, both model-based selection and regularization-based selection techniques will be used to assess and retain the most influential features for accurate prediction, ensuring a more efficient and interpretable model.

A full model is created to predict cholesterol.

#sapply(imputed_nhanes, function(x) length(unique(x)))

#head(imputed_nhanes, 25)


#head(imputed_nhanes)


model_full <- lm(cholestrol ~ age + race + BMI + height + body.weight + 
                  hbp + avg.dbp + avg.sbp + obese, 
                  data = imputed_nhanes)

# Summary of the full model
#summary(model_full)

# Tidy the model summary
model_summary <- tidy(model_full)

# Calculate Mean Squared Error (MSE)
mse <- mean(model_full$residuals^2)

# Calculate R-squared
r2 <- summary(model_full)$r.squared

# Create a data frame for MSE and R-squared
metrics_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                         Value = c(mse, r2))

# Print the model summary using kable
kable(model_summary, digits = 3, caption = "Regression Model Summary")
Regression Model Summary
term estimate std.error statistic p.value
(Intercept) 0.043 0.017 2.447 0.014
age 0.250 0.013 18.847 0.000
raceBlack -0.068 0.024 -2.781 0.005
raceOther -0.206 0.067 -3.058 0.002
BMI -0.122 0.082 -1.477 0.140
height -0.190 0.049 -3.862 0.000
body.weight 0.266 0.093 2.847 0.004
hbp -0.066 0.042 -1.574 0.116
avg.dbp 0.116 0.013 8.774 0.000
avg.sbp 0.054 0.021 2.597 0.009
obese -0.022 0.038 -0.589 0.556
# Print MSE and R-squared using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics")
Model Performance Metrics
Statistic Value
Mean Squared Error 0.876
R-squared 0.124

Stepwise model selection is conducted to retain significant predictors. Selection processes based on AIC and BIC are conducted.

# Stepwise model selection based on AIC
model_stepwise_AIC <- step(model_full, direction = "both", trace = 0)

# Stepwise model selection based on BIC
model_stepwise_BIC <- step(model_full, direction = "both", k = log(nrow(imputed_nhanes)), trace = 0)


# Tidy the summaries for both models
stepwise_AIC_summary <- tidy(model_stepwise_AIC)
stepwise_BIC_summary <- tidy(model_stepwise_BIC)

# Compute MSE for both models
mse_AIC <- mean(model_stepwise_AIC$residuals^2)
mse_BIC <- mean(model_stepwise_BIC$residuals^2)

# Compute R-squared for both models
r2_AIC <- summary(model_stepwise_AIC)$r.squared
r2_BIC <- summary(model_stepwise_BIC)$r.squared

# Create data frames for MSE and R-squared
metrics_AIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_AIC, r2_AIC))

metrics_BIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_BIC, r2_BIC))

# Print the stepwise AIC model summary
kable(stepwise_AIC_summary, digits = 3, caption = "Stepwise AIC Regression Model Summary")
Stepwise AIC Regression Model Summary
term estimate std.error statistic p.value
(Intercept) 0.038 0.015 2.462 0.014
age 0.250 0.013 18.875 0.000
raceBlack -0.068 0.024 -2.792 0.005
raceOther -0.206 0.067 -3.062 0.002
BMI -0.127 0.082 -1.553 0.121
height -0.188 0.049 -3.842 0.000
body.weight 0.264 0.093 2.829 0.005
hbp -0.067 0.042 -1.588 0.112
avg.dbp 0.117 0.013 8.778 0.000
avg.sbp 0.054 0.021 2.604 0.009
# Print performance metrics for AIC model
kable(metrics_AIC_df, digits = 3, caption = "Model Performance Metrics (AIC)")
Model Performance Metrics (AIC)
Statistic Value
Mean Squared Error 0.876
R-squared 0.124
# Print the stepwise BIC model summary
kable(stepwise_BIC_summary, digits = 3, caption = "Stepwise BIC Regression Model Summary")
Stepwise BIC Regression Model Summary
term estimate std.error statistic p.value
(Intercept) 0.000 0.011 0.000 1
age 0.271 0.011 25.433 0
height -0.116 0.012 -9.640 0
body.weight 0.122 0.012 9.908 0
avg.dbp 0.129 0.011 11.519 0
# Print performance metrics for BIC model
kable(metrics_BIC_df, digits = 3, caption = "Model Performance Metrics (BIC)")
Model Performance Metrics (BIC)
Statistic Value
Mean Squared Error 0.879
R-squared 0.121
# Check R-squared value
summary(model_stepwise_AIC)$r.squared
[1] 0.123777
summary(model_stepwise_BIC)$r.squared
[1] 0.1210105

Both AIC and BIC stepwise model selection processes have similar R squared values. The BIC stepwise model is parsimonious. This model selects age, height, body weight, and average diastolic blood pressure as predictors for cholesterol.

Regularization-based selection techniques are useful for improving model performance and preventing overfitting, especially when you have many predictors. Two common regularization methods are Ridge regression and Lasso regression.

# Extract the response and predictors
y <- imputed_nhanes$cholestrol  # Response variable
X <- as.matrix(imputed_nhanes[, c("age", "BMI", "height", "body.weight", "avg.dbp", "avg.sbp")])

# Optional: Scale predictors (recommended for Ridge and Lasso)
X_scaled <- scale(X)

# Fit Ridge and Lasso models
ridge_model <- glmnet(X_scaled, y, alpha = 0)  # Ridge Regression (alpha = 0)
lasso_model <- glmnet(X_scaled, y, alpha = 1)  # Lasso Regression (alpha = 1)

# Ridge cross-validation
cv_ridge <- cv.glmnet(X_scaled, y, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min

# Lasso cross-validation
cv_lasso <- cv.glmnet(X_scaled, y, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

# Extract coefficients for Ridge and Lasso at optimal lambda
ridge_coefficients <- as.data.frame(as.matrix(coef(cv_ridge, s = "lambda.min")))
colnames(ridge_coefficients) <- c("Coefficient")
ridge_coefficients$Predictor <- rownames(ridge_coefficients)

lasso_coefficients <- as.data.frame(as.matrix(coef(cv_lasso, s = "lambda.min")))
colnames(lasso_coefficients) <- c("Coefficient")
lasso_coefficients$Predictor <- rownames(lasso_coefficients)

# Predictions for Ridge
pred_ridge <- predict(cv_ridge, X_scaled, s = "lambda.min")

# Predictions for Lasso
pred_lasso <- predict(cv_lasso, X_scaled, s = "lambda.min")

# Calculate RMSE for Ridge
rmse_ridge <- sqrt(mean((pred_ridge - y)^2))

# Calculate RMSE for Lasso
rmse_lasso <- sqrt(mean((pred_lasso - y)^2))

# Calculate R² for Ridge
rss_ridge <- sum((pred_ridge - y)^2)  # Residual Sum of Squares
tss_ridge <- sum((y - mean(y))^2)    # Total Sum of Squares
r2_ridge <- 1 - rss_ridge / tss_ridge

# Calculate R² for Lasso
rss_lasso <- sum((pred_lasso - y)^2)
tss_lasso <- sum((y - mean(y))^2)
r2_lasso <- 1 - rss_lasso / tss_lasso

# Create a data frame for RMSE and R²
metrics_df <- data.frame(
  Model = c("Ridge", "Lasso"),
  RMSE = c(rmse_ridge, rmse_lasso),
  R2 = c(r2_ridge, r2_lasso)
)

# Print Ridge coefficients using kable
kable(ridge_coefficients, digits = 3, caption = "Ridge Regression Coefficients")
Ridge Regression Coefficients
Coefficient Predictor
(Intercept) 0.000 (Intercept)
age 0.245 age
BMI 0.041 BMI
height -0.086 height
body.weight 0.071 body.weight
avg.dbp 0.109 avg.dbp
avg.sbp 0.039 avg.sbp
# Print Lasso coefficients using kable
kable(lasso_coefficients, digits = 3, caption = "Lasso Regression Coefficients")
Lasso Regression Coefficients
Coefficient Predictor
(Intercept) 0.000 (Intercept)
age 0.255 age
BMI 0.000 BMI
height -0.113 height
body.weight 0.120 body.weight
avg.dbp 0.114 avg.dbp
avg.sbp 0.030 avg.sbp
# Print RMSE and R² values using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics (RMSE & R²)")
Model Performance Metrics (RMSE & R²)
Model RMSE R2
Ridge 0.937 0.121
Lasso 0.937 0.121

The Ridge and Lasso method both had similar R squared values. Therefore, both models are recommended.

7 LINEAR REGRESSION

In this regression analysis of the NHANES dataset, three models are developed using cross validation to predict the outcome variable cholesterol: a full model including all selected predictors, a stepwise BIC model for variable selection, and a Ridge regression model to address multicollinearity. Each model was evaluated based on its predictive performance, with Mean Squared Error (MSE) and R-squared (R^2) used as key comparison metrics. By analyzing these models, the aim is to determine the most accurate and efficient approach for modeling the data while balancing complexity and predictive power.

set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

7.1 Full Model

The full model includes all selected predictor variables, providing a comprehensive approach to estimating cholesterol levels. To assess its stability, we implement k-fold cross-validation, which systematically partitions the dataset into multiple training and validation subsets. This approach mitigates the risk of overfitting by testing the model’s performance on unseen data, yielding a robust estimate of prediction error. The model’s effectiveness is quantified using the Mean Squared Error (MSE), which measures the average squared difference between observed and predicted cholesterol values. By employing cross-validation, we ensure that the full model is evaluated rigorously, providing insights into its accuracy and reliability in real-world applications.

# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)  # 10-fold CV

# Fit the full model
full_model <- train(cholestrol ~ age + race + log_BMI + height + log_body.weight + 
                    hbp + avg.dbp + log_avg.sbp + obese, 
                    data = imputed_nhanes, 
                    method = "lm", 
                    trControl = train_control)

# Extract coefficients from the final model
coeff_table <- summary(full_model$finalModel)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column(var = "Predictor")  # Convert row names to a column

# Calculate Cross-Validation MSE
mse <- mean(full_model$resample$RMSE^2)

# Calculate R² from the final model
r2 <- summary(full_model$finalModel)$r.squared

# Create a data frame for MSE and R²
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared"),
  Value = c(mse, r2)
)

# Print the coefficients using kable
kable(coeff_table, digits = 4, caption = "Regression Model Coefficients", format = "markdown")
Regression Model Coefficients
Predictor Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.5758 1.7615 -2.5977 0.0094
age 0.2434 0.0134 18.1921 0.0000
raceBlack -0.0679 0.0243 -2.7969 0.0052
raceOther -0.2054 0.0671 -3.0602 0.0022
log_BMI 0.6340 0.8695 0.7292 0.4659
height -0.0660 0.0975 -0.6772 0.4983
log_body.weight 0.1011 0.8701 0.1162 0.9075
hbp -0.0668 0.0403 -1.6602 0.0969
avg.dbp 0.1091 0.0134 8.1549 0.0000
log_avg.sbp 0.4214 0.1375 3.0636 0.0022
obese -0.0581 0.0367 -1.5830 0.1135
# Print MSE and R² using kable
kable(metrics_df, digits = 4, caption = "Model Performance Metrics (MSE & R²)", format = "markdown")
Model Performance Metrics (MSE & R²)
Statistic Value
Mean Squared Error 0.8762
R-squared 0.1266

7.2 Stepwise BIC

In this analysis, a Stepwise BIC regression model is implemented to identify the most parsimonious subset of predictors for estimating cholesterol levels. The Stepwise BIC approach selects variables by minimizing the Bayesian Information Criterion (BIC), which penalizes model complexity to prevent overfitting. To assess the model’s generalizability, we employ cross-validation by splitting the dataset into training and testing subsets. The model is trained on the training data, and predictions are made on the test data to compute the Mean Squared Error (MSE) as a measure of model performance. Additionally, the estimated regression coefficients in a structured table for clarity and interpretation. This approach ensures an optimal balance between model complexity and predictive accuracy while validating the model’s robustness.

stepwise_bic_model <- function(train_data, test_data) {
  # Fit initial full model
  full_model2 <- lm(cholestrol ~ age + height + log_body.weight + avg.dbp, data = train_data)
  
  # Perform stepwise selection using BIC
  model_stepwise_BIC <- stepAIC(full_model2, direction = "both", k = log(nrow(train_data)), trace = FALSE)
  
  # Extract coefficients
  coeff_table <- summary(model_stepwise_BIC)$coefficients %>%
    as.data.frame() %>%
    rownames_to_column(var = "Predictor")  # Convert row names to a column

  # Print coefficients using kable
  print(kable(coeff_table, digits = 4, caption = "Stepwise BIC Model Coefficients", format = "markdown"))

  # Predict on test data
  predictions <- predict(model_stepwise_BIC, newdata = test_data)
  
  # Compute Mean Squared Error (MSE)
  mse <- mean((test_data$cholestrol - predictions)^2)

  # Compute R²
  rss <- sum((test_data$cholestrol - predictions)^2)  # Residual Sum of Squares
  tss <- sum((test_data$cholestrol - mean(test_data$cholestrol))^2)  # Total Sum of Squares
  r2 <- 1 - rss/tss  # R-squared
  
  # Create a data frame for MSE and R²
  metrics_df <- data.frame(
    Statistic = c("Mean Squared Error", "R-squared"),
    Value = c(mse, r2)
  )

  # Print MSE and R² using kable
  print(kable(metrics_df, digits = 4, caption = "Stepwise BIC Model Performance Metrics (MSE & R²)", format = "markdown"))

  return(list(model = model_stepwise_BIC, mse = mse, r2 = r2))  # Return the model, MSE, and R²
}

# Example usage (Assuming you have a dataset split into train_data and test_data)
set.seed(123)
train_index <- createDataPartition(imputed_nhanes$cholestrol, p = 0.8, list = FALSE)
train_data <- imputed_nhanes[train_index, ]
test_data <- imputed_nhanes[-train_index, ]

# Run the function
stepwise_results <- stepwise_bic_model(train_data, test_data)


Table: Stepwise BIC Model Coefficients

|Predictor       | Estimate| Std. Error| t value| Pr(>&#124;t&#124;)|
|:---------------|--------:|----------:|-------:|------------------:|
|(Intercept)     |  -3.3622|     0.3416| -9.8427|                  0|
|age             |   0.2671|     0.0120| 22.2994|                  0|
|height          |  -0.1274|     0.0138| -9.2161|                  0|
|log_body.weight |   0.6589|     0.0669|  9.8520|                  0|
|avg.dbp         |   0.1191|     0.0125|  9.5632|                  0|


Table: Stepwise BIC Model Performance Metrics (MSE & R²)

|Statistic          |  Value|
|:------------------|------:|
|Mean Squared Error | 0.8383|
|R-squared          | 0.1425|

7.3 Ridge Regression

Ridge regression is applied to the NHANES dataset to address multicollinearity among predictor variables while maintaining model stability. By imposing a penalty on large coefficients through L2 regularization, Ridge regression prevents overfitting and improves generalization. The model was evaluated using cross-validation, and its performance was compared to other regression approaches based on Mean Squared Error (MSE) and R².

# Define predictor matrix (X) and response variable (y)
X <- as.matrix(imputed_nhanes[, c("age", "log_BMI", "height", "log_body.weight", "avg.dbp", "log_avg.sbp")])
y <- imputed_nhanes$cholestrol

set.seed(123)  # For reproducibility

# Define training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train Ridge Regression Model
ridge_model <- train(
  x = X, y = y,
  method = "glmnet",
  trControl = train_control,
  tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(-3, 3, length = 100))  # Ridge: alpha = 0
)

# Best lambda value selected via cross-validation
best_lambda <- ridge_model$bestTune$lambda

# Predict using best lambda
ridge_predictions <- predict(ridge_model, X)

# Compute Mean Squared Error (MSE)
ridge_mse <- mean((y - ridge_predictions)^2)

# Compute R²
rss <- sum((y - ridge_predictions)^2)  # Residual Sum of Squares
tss <- sum((y - mean(y))^2)  # Total Sum of Squares
ridge_r2 <- 1 - (rss / tss)  # R² calculation

# Extract coefficients from the final Ridge model
coefficients <- as.data.frame(as.matrix(ridge_model$finalModel$beta)) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

#print(kable(head(coefficients, 10), digits = 4, caption = "Top 10 Ridge Regression Coefficients", format = "markdown"))


# Create a data frame for MSE, R², and Best Lambda
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared", "Optimal Lambda"),
  Value = c(ridge_mse, ridge_r2, best_lambda)
)

# Print model performance metrics using kable()
print(kable(metrics_df, digits = 4, caption = "Ridge Regression Performance Metrics", format = "markdown"))


Table: Ridge Regression Performance Metrics

|Statistic          |  Value|
|:------------------|------:|
|Mean Squared Error | 0.8757|
|R-squared          | 0.1242|
|Optimal Lambda     | 0.0285|

7.4 Comparison

After performing cross-validation on the NHANES_imputed dataset, the predictive performance of the thre models are compared. The BIC model has the smallest MSE and highest R^2 value, indicating a better performing regression model. In, turn, the BIC model is recommended.

8 LOGISTIC REGRESSION

This analysis applies logistic regression to predict current smoking status using data from NHANES. Four models are developed: a full model including all relevant predictors, a stepwise AIC model for variable selection, a BIC model for a more parsimonious approach, and a regularized logistic regression model using Lasso to prevent overfitting. The models are evaluated and compared using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) to assess their classification performance and predictive accuracy.

8.1 Full Model

The first candidate, a full model, is fitted to predict current smoking status.

# Fit logistic regression model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Extract coefficients from the model
coefficients <- summary(full_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table <- as.data.frame(coefficients) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table, digits = 4, caption = "Logistic Regression Model Coefficients", format = "markdown")
Logistic Regression Model Coefficients
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.6100 4.2576 -2.2571 0.0240
age 0.8518 0.0341 24.9430 0.0000
raceBlack -0.8473 0.0583 -14.5294 0.0000
raceOther -0.3106 0.1567 -1.9825 0.0474
log_body.weight 2.6402 2.0813 1.2685 0.2046
log_BMI -0.6406 2.0792 -0.3081 0.7580
obese -0.1208 0.0867 -1.3930 0.1636
height -0.2097 0.2330 -0.8999 0.3682
log_avg.sbp -0.4720 0.3284 -1.4375 0.1506
avg.dbp 0.0655 0.0324 2.0189 0.0435
log_cholestrol 0.1397 0.1348 1.0365 0.3000
hbp -0.0763 0.0957 -0.7975 0.4252

Age, race, and average dbp are significant features in the full model.

8.2 Stepwise AIC Model

The second model, stepwise AIC, is fitted to predict current smoking status. This model automatically selects a subset of predictors using AIC-based stepwise selection.

# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using AIC
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# Extract coefficients from the stepwise model
coefficients_stepwise <- summary(stepwise_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_stepwise <- as.data.frame(coefficients_stepwise) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_stepwise, digits = 4, caption = "Stepwise Logistic Regression Model Coefficients", format = "markdown")
Stepwise Logistic Regression Model Coefficients
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.1104 1.5749 -4.5148 0.0000
age 0.8583 0.0335 25.6466 0.0000
raceBlack -0.8515 0.0582 -14.6232 0.0000
raceOther -0.3164 0.1565 -2.0212 0.0433
log_body.weight 2.0364 0.2072 9.8304 0.0000
obese -0.1260 0.0865 -1.4575 0.1450
height -0.1417 0.0341 -4.1583 0.0000
log_avg.sbp -0.6344 0.2525 -2.5125 0.0120
avg.dbp 0.0698 0.0323 2.1632 0.0305

All features, other than obese, are significant in the stepwise AIC model.

8.3 BIC Model

The third candidate,a BIC model, is fitted to predict current smoking status. This model fits a more parsimonious model.

# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using BIC (using k = log(nrow(imputed_nhanes)))
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Extract coefficients from the BIC model
coefficients_bic <- summary(bic_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_bic <- as.data.frame(coefficients_bic) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_bic, digits = 4, caption = "BIC Stepwise Model Coefficients", format = "markdown")
BIC Stepwise Model Coefficients
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.1460 0.7115 -12.8541 0.0000
age 0.8140 0.0268 30.3780 0.0000
raceBlack -0.8552 0.0579 -14.7775 0.0000
raceOther -0.3123 0.1562 -1.9988 0.0456
log_body.weight 1.8297 0.1394 13.1290 0.0000
height -0.1092 0.0290 -3.7669 0.0002

All features are significant in the BIC model.

8.4 Regularized Logistic Regression Model (Lasso)

The fourth candidate model uses Regularized Logistic Regression (Lasso). This method selects the most important predictors.

# Define predictor matrix (X) and response variable (y)
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk

# Train the Lasso model using cross-validation
lasso_model <- cv.glmnet(x, y, family = "binomial", alpha = 1)


# Best lambda value selected via cross-validation
best_lambda_lasso <- lasso_model$lambda.min
print(paste("Optimal Lambda for Lasso:", best_lambda_lasso))
[1] "Optimal Lambda for Lasso: 0.000783042982863974"
# Extract coefficients from the final Lasso model at the optimal lambda
lasso_coefficients <- coef(lasso_model, s = "lambda.min")

# Convert coefficients to a data frame
lasso_coefficients_df <- as.data.frame(as.matrix(lasso_coefficients))
lasso_coefficients_df$Predictor <- rownames(lasso_coefficients_df)

# Reorder columns to make predictor names the first column
lasso_coefficients_df <- lasso_coefficients_df %>%
  select(Predictor, everything())

# Print the coefficients table using kable
kable(lasso_coefficients_df, digits = 4, caption = "Lasso Regression Coefficients", format = "markdown")
Lasso Regression Coefficients
Predictor s1
(Intercept) (Intercept) -6.4108
age age 0.8393
raceBlack raceBlack -0.8379
raceOther raceOther -0.2843
log_body.weight log_body.weight 0.7797
log_BMI log_BMI 1.1279
obese obese -0.0866
height height 0.0000
log_avg.sbp log_avg.sbp -0.3598
avg.dbp avg.dbp 0.0546
log_cholestrol log_cholestrol 0.1316
hbp hbp -0.0746

8.5 Comparison

In this analysis, four logistic regression models—Full Model, Stepwise AIC Model, BIC Model, and Lasso Model are compared using ROC (Receiver Operating Characteristic) curves and AUC (Area Under the Curve) to assess their predictive performance. The optimal cutoff for each model using Youden’s Index is determined, which maximizes the balance between sensitivity and specificity. By evaluating these models through both graphical and numerical metrics, insights is gained into which model provides the best classification performance for predicting the outcome variable.

# Full Model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, data = imputed_nhanes, family = binomial)

# Stepwise AIC Model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# BIC Model
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Lasso Model
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp , data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk
lasso_cv <- cv.glmnet(x, y, family = "binomial", alpha = 1)
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = lasso_cv$lambda.min)

# Get predicted probabilities
imputed_nhanes$full_pred <- predict(full_model, type = "response")
imputed_nhanes$stepwise_pred <- predict(stepwise_model, type = "response")
imputed_nhanes$bic_pred <- predict(bic_model, type = "response")
imputed_nhanes$lasso_pred <- predict(lasso_model, newx = x, type = "response")[,1]

# Create ROC curves
full_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$full_pred)
stepwise_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$stepwise_pred)
bic_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$bic_pred)
lasso_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$lasso_pred)

# Calculate optimal cutoff using Youden's Index
optimal_cutoff_full <- full_roc$specificities[which.max(full_roc$sensitivities + full_roc$specificities - 1)]
optimal_cutoff_stepwise <- stepwise_roc$specificities[which.max(stepwise_roc$sensitivities + stepwise_roc$specificities - 1)]
optimal_cutoff_bic <- bic_roc$specificities[which.max(bic_roc$sensitivities + bic_roc$specificities - 1)]
optimal_cutoff_lasso <- lasso_roc$specificities[which.max(lasso_roc$sensitivities + lasso_roc$specificities - 1)]

# Print optimal cutoffs
cat("Optimal cutoff for Full Model:", optimal_cutoff_full, "\n")
Optimal cutoff for Full Model: 0.714215 
cat("Optimal cutoff for Stepwise AIC Model:", optimal_cutoff_stepwise, "\n")
Optimal cutoff for Stepwise AIC Model: 0.7191679 
cat("Optimal cutoff for BIC Model:", optimal_cutoff_bic, "\n")
Optimal cutoff for BIC Model: 0.7209014 
cat("Optimal cutoff for Lasso Model:", optimal_cutoff_lasso, "\n")
Optimal cutoff for Lasso Model: 0.7013373 
# Plot all ROC curves
ggplot() +
  geom_line(aes(x = full_roc$specificities, y = full_roc$sensitivities, color = "Full Model")) +
  geom_line(aes(x = stepwise_roc$specificities, y = stepwise_roc$sensitivities, color = "Stepwise AIC")) +
  geom_line(aes(x = bic_roc$specificities, y = bic_roc$sensitivities, color = "BIC Model")) +
  geom_line(aes(x = lasso_roc$specificities, y = lasso_roc$sensitivities, color = "Lasso Model")) +
  labs(title = "ROC Curve Comparison", x = "1 - Specificity", y = "Sensitivity") +
  scale_color_manual(values = c("red", "blue", "green", "purple", "orange")) +
  theme_minimal()

# AUC values
auc_values <- data.frame(
  Model = c("Full Model", "Stepwise AIC", "BIC Model", "Lasso Model"),
  AUC = c(auc(full_roc), auc(stepwise_roc), auc(bic_roc), auc(lasso_roc))
)

print(auc_values)
         Model       AUC
1   Full Model 0.7509937
2 Stepwise AIC 0.7507281
3    BIC Model 0.7496650
4  Lasso Model 0.7508550

All models have very similar AUC. In turn, the more parsimonious model, BIC, is recommended.

9 CONCLUSION

In this analysis, a comprehensive approach was taken to explore and model the NHANES dataset. Beginning with Exploratory Data Analysis (EDA), patterns and missing values were identified, leading to the implementation of Multiple Imputation by Chained Equations (MICE) to handle missing data. Feature engineering and transformation were applied to enhance model performance, followed by feature selection to refine predictor variables. Both linear and logistic regression models were developed and evaluated, including full models, stepwise selection methods, and regularized regression techniques. Ultimately, the Bayesian Information Criterion (BIC) emerged as the best-performing model for both linear and logistic regression, demonstrating its effectiveness in balancing model complexity and predictive accuracy.

---
title: 'Analysis of The National Health and Nutrition Examination Survey (NHANES)'
author: "Joanne Musa"
date: "March 5, 2025"
output: 
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
  font-weight: bold;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
  font-weight: bold;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
    font-weight: bold;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}

if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}

if (!require("ggridges")) {
   install.packages("ggridges")
library(ggridges)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}

if (!require("ggplot2")) {
   install.packages("ggplot2")
   library(ggplot2)
}
if (!require("nnet")) {
   install.packages("nnet")
   library(nnet)
}
 
if (!require("nortest")) {
   install.packages("nortest")
   library(nortest)
} 

if (!require("corrplot")) {
   install.packages("corrplot")
   library(corrplot)
}

if (!require("VIM")) {
   install.packages("VIM")
   library(VIM)
}


if (!require("mice")) {
   install.packages("mice")
   library(mice)
}

if (!require("naniar")) {
   install.packages("naniar")
   library(naniar)

}

if (!require("GGally")) {
   install.packages("GGally")
   library(GGally)

}

if (!require("glmnet")) {
   install.packages("glmnet")
   library(glmnet)

}

if (!require("caret")) {
   install.packages("caret")
   library(caret)

}

if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)

}


knitr::opts_chunk$set(
                      echo = TRUE,   
                      warning = FALSE,  
                      result = TRUE,    
                      message = FALSE,
                      comment = NA
                      )  
```



# INTRODUCTION


```{r}
nhanes <- read.csv("https://jmusa3.github.io/jmusa/nhanes.csv")

#CREATE MISSING VALUES in some variables

# Set seed for reproducibility
set.seed(123)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.14 * nrow(nhanes))

# Randomly select row indices to introduce missing values 
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$body.weight[missing_indices] <- NA

# Verify missing values
#sum(is.na(nhanes$body.weight))  


# Set seed for reproducibility
set.seed(456)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$cholestrol[missing_indices] <- NA


# Set seed for reproducibility
set.seed(789)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in race to NA
nhanes$race[missing_indices] <- NA

# Verify missing values

#sum(is.na(nhanes$race))  
#sum(is.na(nhanes$cholestrol)) 
#sum(is.na(nhanes$body.weight)) 

# create a copy of nhanes for MICE
nhanes1 <- nhanes
nhanes1$race <- as.factor(nhanes1$race)
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)



```


The National Health and Nutrition Examination Survey (NHANES) is a program conducted by the National Center for Health Statistics (NCHS), a division of the Centers for Disease Control and Prevention (CDC). NHANES has been conducted in various forms since the early 1960s, with the current continuous survey format beginning in 1999. The survey is designed to assess the health and nutritional status of adults and children in the United States through a combination of interviews, physical examinations, and laboratory tests. Data is collected from a nationally representative sample of the U.S. population using a complex, multistage probability sampling design. Participants undergo household interviews followed by health examinations at mobile examination centers (MECs), where trained medical professionals collect biological samples and conduct clinical assessments. The NHANES data provides critical insights into public health trends, helping policymakers, researchers, and healthcare professionals make informed decisions on national health priorities. The purpose of this study is to assess the health and nutritional status of the study participants with metrics such as high blood pressure. 

There are 15 features and 7927 observations in this data set. These features are listed below.

1) obs: Respondent Identification Number
2) psu: Pseudo-PSU 1,2 psu
3) stratum: Pseudo-stratum (01 - 49)
4) stat.weight: Statistical weight (225.93 - 139744.9)
5) age: Age (years) 
6) sex: 0 = Female, 1 = Male sex
7) race: Race (1 = White; 2 = Black; 3 = Other)
8) body.weight: Body Weight (pounds)
9) height: Standing Height (inches)
10) avg.sbp: Average Systolic BP (mm/Hg) 
11) avg.dbp: Average Diastolic BP (mm/Hg) 
12) past.smk: Has respondent smoked > 100 cigarettes in life (1 = Yes, 2 = No)
13) current.smk: Does respondent smoke cigarettes now? (1 = Yes, 2 = No)
14) cholestrol: Serum Cholesterol (mg/100ml) 
15) hbp: High Blood Pressure (0 if PEPMNK1R <= 140; 1 if PEPMNK1R > 140)

The obs, psu, stratum, and stat.weight variables are not included in this analysis because they do not directly provide health-related information. 

The data set has 317 missing values for body weight and 396 missing values for cholesterol. 

# EDA


## Distribution of Individual Features

The distributions of the individual features in the NHANES dataset provide a snapshot of the characteristics of the population in terms of various health and demographic variables. Understanding these distributions is essential for identifying patterns, detecting potential biases, and assessing the range and variability of each feature. This analysis helps in understanding how variables such as age, body weight, blood pressure, and lifestyle factors are spread across the dataset, which is crucial for subsequent statistical modeling and interpretation.

```{r}

# Convert variables to factors with proper labels
nhanes$sex <- factor(nhanes$sex, levels = c(0, 1), labels = c("Female", "Male"))
nhanes$race <- factor(nhanes$race, levels = c(1, 2, 3), labels = c("White", "Black", "Other"))
nhanes$past.smk <- factor(nhanes$past.smk, levels = c(1, 2), labels = c("Yes", "No"))
nhanes$current.smk <- factor(nhanes$current.smk, levels = c(1, 2), labels = c("Yes", "No"))

# Create frequency tables
sex_count <- table(nhanes$sex)

race_count <- table(nhanes$race)

past.smk_count <- table(nhanes$past.smk)

current.smk_count <- table(nhanes$current.smk)


par(mfrow = c(2, 2))

# Plot the bar charts
barplot(height = sex_count, col = "steelblue", main = "Sex", xlab = "Sex", ylab = "Count")

barplot(height = race_count, col = "steelblue", main = "Race", xlab = "Race", ylab = "Count")

barplot(height = past.smk_count, col = "steelblue", main = "Past Smoker", xlab = "Past Smoker", ylab = "Count")

barplot(height = current.smk_count, col = "steelblue", main = "Current Smoker", xlab = "Current Smoker", ylab = "Count")

```

There are slightly more male observations than female observation in this data set. The majority of participants are white, followed by black. All participants are recorded as being past smokers, which is likely an error. There is an imbalance for past smoker. In turn, past smoker cannot be used for analysis. About half of current participants are current smokers. 

```{r}

par(mfrow = c(2, 2))  

# Histogram for Cholesterol
hist(nhanes$cholestrol, 
     main = "Histogram of Cholesterol", 
     xlab = "Cholesterol", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  # Control number of bins
    
     
# Histogram for Systolic BP
hist(nhanes$avg.sbp, 
     main = "Histogram of Systolic BP", 
     xlab = "Systolic BP", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
    

# Histogram for Body Weight
hist(nhanes$body.weight, 
     main = "Histogram of Body Weight", 
     xlab = "Body Weight", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
  
     
# Histogram for Height
hist(nhanes$body.weight, 
     main = "Histogram of Height", 
     xlab = "Height", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)   
     
```

Most participants do not have high blood pressure. The age of participants varies from about 20 to about 90. The height distribution is roughly symmetric. The body weight distribution is skewed right due to some high outliers. The high outliers may indicate obesity. 



```{r}

par(mfrow = c(2,2))

hist(nhanes$avg.sbp, col = "steelblue", 
     main = "Systolic BP",
     xlab = "Systolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$avg.dbp, col = "steelblue", 
     main = "Diastolic BP",
     xlab = "Diastolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$cholestrol, col = "steelblue", 
     main = "Cholesterol",
     xlab = "Cholesterol", ylab = "Frequency",
     breaks = 30)


```

Systolic BP and cholesterol have skewed right distributions. Diastolic BP is roughly symmetric. Of the 7927 participants, 1964 are obese. 


## Relationship between Features

Examining relationships between variables using scatterplots and a correlation matrix provides a comprehensive approach to understanding associations. Scatterplots visually reveal how two continuous variables interact, allowing us to identify trends, correlations, and potential outliers. Meanwhile, a correlation matrix quantifies the strength and direction of relationships between multiple variables at once, helping to pinpoint which pairs are strongly correlated. Together, scatterplots and a correlation matrix offer valuable insights into the underlying patterns in the data, guiding further statistical analysis and model selection.

```{r}
# Set seed for reproducibility
set.seed(123)

# Sample 500 rows (adjust the number as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Create a pairwise scatterplot matrix with the sampled data
pairs(sampled_data, 
      main = "Pairwise Scatterplots (Bootstrapped Sample)",
      pch = 19,           # Shape of the points (circle)
      col = rgb(0, 0, 1, 0.5)  # Color with transparency (blue)
)

```



```{r}

nhanes_subset <- nhanes[, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Compute the correlation matrix
cor_matrix <- cor(nhanes_subset, use = "pairwise.complete.obs")

# Print the correlation matrix
print(cor_matrix)

corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)


```

The most highly correlate features are age and average systolic blood pressure, body weight and height, average systolic blood pressure and average diastolic blood pressure, age and cholesterol, body weight and cholesterol, and body weigh and average diastolic blood pressure. The pairwise relationships are displayed with individual scatter plots. 

```{r}
# Set seed for reproducibility
set.seed(123)

# Bootstrap a smaller sample (adjust size as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, ]

# Set up a 2x3 plotting grid
par(mfrow = c(2, 3))

# Scatterplot for Age and Avg. Systolic Blood Pressure
plot(sampled_data$age, sampled_data$avg.sbp, main = "Age vs Avg. Systolic BP", 
     xlab = "Age", ylab = "Avg. Systolic BP", col = "blue", pch = 19)

# Scatterplot for Body Weight and Height
plot(sampled_data$body.weight, sampled_data$height, main = "Body Weight vs Height", 
     xlab = "Body Weight", ylab = "Height", col = "red", pch = 19)

# Scatterplot for Avg. Systolic Blood Pressure and Avg. Diastolic Blood Pressure
plot(sampled_data$avg.sbp, sampled_data$avg.dbp, main = "Avg. Systolic vs Avg. Diastolic BP", 
     xlab = "Avg. Systolic BP", ylab = "Avg. Diastolic BP", col = "forestgreen", pch = 19)

# Scatterplot for Age and Cholesterol
plot(sampled_data$age, sampled_data$cholestrol, main = "Age vs Cholesterol", 
     xlab = "Age", ylab = "Cholesterol", col = "purple", pch = 19)

# Scatterplot for Avg. Systolic BP and Cholesterol
plot(sampled_data$avg.sbp, sampled_data$cholestrol, main = "Avg Systolic BP vs Cholesterol", 
     xlab = "Cholesterol", ylab = "Avg.SBP", col = "gold", pch = 19)

# Scatterplot for Body Weight and Avg. Diastolic Blood Pressure
plot(sampled_data$body.weight, sampled_data$avg.dbp, main = "Body Weight vs Avg. Diastolic BP", 
     xlab = "Body Weight", ylab = "Avg. Diastolic BP", col = "maroon", pch = 19)
```

Six of the most correlated pairwise relationship are visualized in scatter plots. All pairwise relationships are linear with no apparent clustering. 


```{r}

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

# Create a contingency table
sex_hbp_table <- table(nhanes$sex, nhanes$hbp)

# Generate mosaic plot
mosaicplot(sex_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and High Blood Pressure",
           xlab = "Sex Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
race_hbp_table <- table(nhanes$race, nhanes$hbp)

# Generate mosaic plot
mosaicplot(race_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Race and High Blood Pressure",
           xlab = "Race", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
current.smk_hbp_table <- table(nhanes$current.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(current.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and HBP",
           xlab = "Current Smoking Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
past.smk_hbp_table <- table(nhanes$past.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(past.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Past Smoking Status and HBP",
           xlab = "Past Smoking Status", 
           ylab = "HBP Status",
           border = "black")
```

Relationship between high blood pressure and race, sex, and current smoking status can be visualized in the mosaic plots. There appears to be an association between high current smoking and race with high blood pressure. A less pronounced relationship can been seen between sex and race with high blood pressure. 


```{r}

# Set up a 2x2 plotting grid
par(mfrow = c(1, 2))


# Create a contingency table
current.smk_race_table <- table(nhanes$race, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Current Smoke Status",
           xlab = "Race", 
           ylab = "Current Smoke",
           border = "black")


# Create a contingency table
current.smk_sex_table <- table(nhanes$sex, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Current Smoke Status",
           xlab = "Sex", 
           ylab = "Current Smoke",
           border = "black")

# Create a contingency table
past.smk_race_table <- table(nhanes$race, nhanes$past.smk)



```

An association exists between race and current smoking. Black individuals tend to smoke more than white or other individuals. An association exists between sex and current smoke status. Female tend to smoke currently slight more than males in this population. 



# MISSING VALUES

## Imputation for Categorical Features

The distribution of missing values for is examined to determine any patterns. This step is done to ensure that value are missing at random. 

```{r}
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

#sum(is.na(nhanes$race))

#mean(is.na(nhanes$race))

#table(nhanes$race, useNA = "ifany")


#ggplot(nhanes, aes(x = factor(race))) +
  #geom_bar(fill = "blue") +
  #labs(title = "Distribution of Race (Before Imputation)", x = "Race", y = "Count") +
  #theme_minimal()


aggr(nhanes, col = c("navyblue", "red"), numbers = TRUE, sortVars = TRUE, 
     labels = names(nhanes), cex.axis = 0.7, gap = 2, 
     ylab = c("Missing Data Overview", "Pattern of Missing Data"))

```

A pattern of missing values exists between body weight, BMI, and obese. This pattern can be explained by the fact that body weight is used to calculate BMI, which is then used to classify observations for obese. However, the pattern of missing value appears to be random for the other features. 

No pattern of missing values exists between race and other features, In turn, kNN imputation is used for missing values of race.  



```{r}

sum(is.na(nhanes$race))
table(nhanes$race, useNA = "ifany")  # Check distribution including NAs


# Perform KNN Imputation (single imputation)
nhanes_imputed <- kNN(nhanes, variable = "race", k = 3)

# Keep only the original columns (without extra columns added by kNN)
nhanes_imputed <- nhanes_imputed[, names(nhanes)]



# COMPARE before and after imputation distribution

par(mfrow = c(1,2))  # Split plotting area into two

# Convert race to a factor for better visualization
nhanes$race <- as.factor(nhanes$race)
nhanes_imputed$race <- as.factor(nhanes_imputed$race)

# Create a new column to indicate whether the data is original or imputed
nhanes$source <- "Before Imputation"
nhanes_imputed$source <- "After Imputation"

# Combine both datasets
nhanes_compare <- bind_rows(nhanes, nhanes_imputed)

# Plot the distribution before and after imputation
ggplot(nhanes_compare, aes(x = race, fill = source)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Race Before and After Imputation",
       x = "Race",
       y = "Count",
       fill = "Dataset") +
  theme_minimal()

imputed_nhanes <- nhanes_imputed


```
The overall distribution of race after imputation changed slightly after imputation.


## Regression-Based Imputation for Numerical Feature

The pattern of missing values for body weight and cholesterol is examined for any potential patterns.

```{r}


#vis_miss(nhanes[, c("body.weight", "cholestrol")])


# Set smaller font size for the plot
par(cex = 0.4)  # Adjust this value for desired font size

# Generate the missing data pattern plot

md.pattern(imputed_nhanes)

# Reset graphics settings (optional)
par(cex = 1)


```
A pattern of missing values exists between body weight, BMI, and obese because


Body weight is correlated with height, average diastolic blood pressure, and sex. In turn, the missing body weight values will be imputed using a regression model using said explanatory features. 

```{r}

# Subset nhanes to contain only complete observations for selected variables
complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("body.weight", "height", "avg.dbp", "sex")]), ]

# Fit the linear regression model
model_body.weight <- lm(body.weight ~ height + avg.dbp + sex, data = complete_nhanes)

# View the model summary
summary(model_body.weight)

# Plot the residuals
plot(model_body.weight$residuals)


```


The residual plot shows no patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied.

```{r}

# Create a copy of the dataset to work with
#imputed_nhanes <- complete_nhanes

colSums(is.na(imputed_nhanes))
# Identify the rows with missing values for body.weight
missing_body_weight <- is.na(imputed_nhanes$body.weight)

# Use the model to predict the missing values
imputed_nhanes$body.weight[missing_body_weight] <- predict(model_body.weight, 
                                                           newdata = imputed_nhanes[missing_body_weight, c("height", "avg.dbp", "sex", "age")])

# Check the dataset
#head(imputed_nhanes)

# Create a plot to overlay the density curves
ggplot() +
  geom_density(data = complete_nhanes, aes(x = body.weight), fill = "blue", alpha = 0.2, na.rm = TRUE) + 
  geom_density(data = imputed_nhanes, aes(x = body.weight), fill = "red", alpha = 0.2, na.rm = TRUE) +
  labs(title = "Density Plot of Body Weight: Before and After Imputation",
       x = "Body Weight", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Body Weight", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))


```


The distribution before and after imputation is similar. In turn, the imputation did not change the distribution of body weight. 




The features that are most correlated to cholesterol are age, average systolic blood pressure, and body weight. In turn, the missing values of cholesterol will be imputed with a model using these features. 

```{r}

# Subset nhanes to include only rows with complete cases for cholestrol, age, avg.sbp, and BMI
complete_nhanes <- imputed_nhanes[!is.na(imputed_nhanes$cholestrol) & !is.na(imputed_nhanes$age) & 
                           !is.na(imputed_nhanes$avg.sbp) & !is.na(imputed_nhanes$body.weight), ]

complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("cholestrol", "age", "avg.sbp", "body.weight")]), ]

# Fit the linear regression model to predict cholestrol
model_cholestrol <- lm(cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
summary(model_cholestrol)


# View the model summary
summary(model_cholestrol)

# Create a copy of nhanes for imputation
#imputed_nhanes <- nhanes

# Identify rows with missing cholestrol values
missing_cholestrol <- is.na(imputed_nhanes$cholestrol)

# Use the model to predict the missing cholestrol values
imputed_nhanes$cholestrol[missing_cholestrol] <- predict(model_cholestrol, 
                                                          newdata = imputed_nhanes[missing_cholestrol, c("age", "avg.sbp", "body.weight")])


```

All explanatory features are significant. 

```{r}
# Get residuals from the model for complete_nhanes
residuals_cholestrol <- residuals(model_cholestrol)

# Predicted values for complete_nhanes
predicted_cholestrol <- predict(model_cholestrol, newdata = complete_nhanes)

# Create a data frame for residuals and predicted values
residuals_data <- data.frame(
  predicted_cholestrol = predicted_cholestrol,
  residuals_cholestrol = residuals_cholestrol
)

# Create the residual plot
library(ggplot2)
ggplot(residuals_data, aes(x = predicted_cholestrol, y = residuals_cholestrol)) +
  geom_point(color = "blue") +  # Plot the residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal line at 0
  labs(title = "Residual Plot for Cholestrol Prediction",
       x = "Predicted Cholestrol",
       y = "Residuals") +
  theme_minimal()

# Create the overlay density plot
ggplot() +
  geom_density(data = nhanes, aes(x = cholestrol), fill = "blue", alpha = 0.2, na.rm = TRUE) +  # Before imputation
  geom_density(data = imputed_nhanes, aes(x = cholestrol), fill = "red", alpha = 0.2, na.rm = TRUE) +  # After imputation
  labs(title = "Density Plot of Cholestrol: Before and After Imputation",
       x = "Cholestrol", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Cholestrol", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))


```

The distribution of cholesterol did not change significantly after imputation. As well, the residual plot does not display any patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied. 

## Feature Engineer

In turn a feature which calculates BMI is created to determine observations are classified as obese. According to the CDC (https://www.cdc.gov/nccdphp/dnpao/data-trends-maps/help/npao_dtm/definitions.html), Adult obesity is defined as body mass index (BMI) ≥ 30.0. Another feature named obese, based on BMI, is created, where 0 = no and 1 = yes. 

```{r}

# Create BMI feature
imputed_nhanes$BMI <- (imputed_nhanes$body.weight * 703) / (imputed_nhanes$height * imputed_nhanes$height)
# Round the BMI to the nearest ones place
imputed_nhanes$BMI <- round(imputed_nhanes$BMI, 0)

# Create the obese feature based on BMI
imputed_nhanes$obese <- ifelse(imputed_nhanes$BMI >= 30, 1, 0)

```

```{r}

par(mfrow = c(1, 2))

# Histogram for BMI

hist(imputed_nhanes$BMI, 
     main = "Histogram of BMI", 
     xlab = "BMI", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)

# bar chart for obese
nhanes$obese <- factor(imputed_nhanes$obese, levels = c(0, 1), labels = c("No", "Yes"))

obese_count <- table(imputed_nhanes$obese)

barplot(height = obese_count, col = "steelblue", main = "Obesity", xlab = "Obesity", ylab = "Count")


```


```{r}
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))


# Create a contingency table
obese_hbp_table <- table(imputed_nhanes$obese, imputed_nhanes$hbp)

# Generate mosaic plot
mosaicplot(obese_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Obesity and High Blood Pressure",
           xlab = "Obesity Status", 
           ylab = "HBP Status",
           border = "black")

# Create a contingency table
obese_sex_table <- table(imputed_nhanes$sex, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Obesity",
           xlab = "Obesity Status", 
           ylab = "Sex",
           border = "black")


# Create a contingency table
obese_race_table <- table(imputed_nhanes$race, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Obesity",
           xlab = "Obesity Status", 
           ylab = "Race",
           border = "black")


# Create a contingency table
obese_smoke_table <- table(imputed_nhanes$current.smk, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_smoke_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and Obesity",
           xlab = "Obesity Status", 
           ylab = "Current Smoking Status",
           border = "black")


```





## Multiple Imputation

Multiple Imputation by Chained Equations (MICE) is a powerful technique used to handle missing data by creating multiple plausible imputations for each missing value based on relationships in the observed data. For body weight, cholesterol, and race, MICE can be applied to generate accurate imputations, accounting for correlations between these variables and other features in the dataset. This method allows for more reliable statistical analyses by reducing bias and preserving variability in the imputed values, making it especially useful for handling missing data in incoming new data sets.

```{r}

nhanes1$race <- as.factor(nhanes1$race)  
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)
nhanes1$body.weight <- as.numeric(nhanes1$body.weight)

nhanes2 <- nhanes1


# 1 initialization

init <- mice(nhanes2, maxit=0)
print(init$method)


# 2 imputation models

imp <- mice(nhanes2, method = c("", "", "", "", "", "", "", "polyreg", "pmm", "", "", "", "", "", "pmm", ""), 
            maxit=10,
            m=1,
            seed = 123,
            print=F)

complete(imp, action = 1)[1:10,]


# 3 multiple (iterative) imputations

imp5 <- mice(nhanes2, method = "pmm", m = 5, maxit = 10, seed = 123, print=F)
plot(imp5)



```


Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the body weight variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for body weight earlier in this analysis, providing a measure of the imputation accuracy.


```{r}
model5_bodyweight <- with(imp5, lm(body.weight ~ height + avg.dbp + sex))
summary.stats = summary(model5_bodyweight)

summary(pool(model5_bodyweight))

beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)


```
The model can be used to predict body weight for incoming new data. 

The MSE for single body weight imputation used earlier in this analysis is MSE = (RSE)^2 = (33.52)^2 = 1123.6 which is substantially higher than MSE when using MICE. In turn, MICE is recommended in this analysis. 



Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the cholesterol variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for cholesterol earlier in this analysis, providing a measure of the imputation accuracy.

```{r}
model5_chol <- with(imp5, lm(cholestrol ~ age + avg.sbp))
summary.stats = summary(model5_chol)


summary(pool(model5_chol))

beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)


```
The model can be used to predict cholesterol for incoming new data. 

The Mean Squared Error (MSE) from the Residual Standard Error (RSE) given in the model output from the cholesterol single imputation method used early in this analysis is MSE = (42.59)^2 = 1813.9. Thus, the Mean Squared Error (MSE) = 1813.9.

The MSE using MICE is 3.66 which is substantially less. Therefore, MICE is recommended in the analysis of this data set. 



For the missing values in the race variable, Multiple Imputation by Chained Equations (MICE) will be applied using multinomial logistic regression. This method will account for the categorical nature of the race variable, allowing for appropriate imputation based on the relationships with other features in the dataset.

```{r}

model_race <- with(imp5, multinom(race ~ avg.dbp + avg.sbp + cholestrol + hbp))

# Pool the results from the imputed datasets
pooled_race <- pool(model_race)

# Get a summary of the pooled results
summary(pooled_race)

```

The model can be used to predict race for incoming new data. 


# FEATURE TRANSFORMING

Based on the distribution patterns observed in the NHANES data set, several feature engineering tasks can be performed to enhance the data’s usability for modeling. First, for age, which is roughly uniform, no major transformation is needed, though we may consider binning it into age groups if it helps with interpretability. For weight and systolic blood pressure (BP), which are both skewed right, a log transformation could be applied to reduce skewness and make these variables more normally distributed, improving model performance. Similarly, for cholesterol, which is slightly skewed right, a log transformation may also be useful. For diastolic BP, which is roughly symmetric, no transformation is necessary, but checking for outliers could be helpful. These transformations aim to stabilize variance, reduce the impact of extreme values, and create features that are more suitable for modeling with techniques like linear regression or other statistical methods.


```{r}
#Log transformation to reduce skew

imputed_nhanes$log_body.weight <- log(imputed_nhanes$body.weight)
imputed_nhanes$log_avg.sbp <- log(imputed_nhanes$avg.sbp)
imputed_nhanes$log_cholestrol <- log(imputed_nhanes$cholestrol)
imputed_nhanes$log_BMI <- log(imputed_nhanes$BMI)




# Set up the plotting area to show 2x2 grid of histograms
par(mfrow = c(2, 2))  # This will create a 2x2 grid of plots

# Histogram for Log-Cholesterol
hist(imputed_nhanes$log_cholestrol, 
     main = "Histogram of Log(Cholesterol)", 
     xlab = "Log(Cholesterol)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  
    
# Histogram for Log-Body Weight
hist(imputed_nhanes$log_body.weight, 
     main = "Histogram of Log(Body Weight)", 
     xlab = "Log(Body Weight)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20) 


# Histogram for Log-Systolic BP
hist(imputed_nhanes$log_avg.sbp, 
     main = "Histogram of Log(Systolic BP)", 
     xlab = "Log(Systolic BP)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
   

# Histogram for Log-BMI
hist(imputed_nhanes$log_BMI, 
     main = "Histogram of Log(BMI)", 
     xlab = "Log(BMI)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)


```

The distributions of these previously skewed features are more symmetric after log transformations. 


# STANDARDIZING

Standardizing features ensures that variables with different scales, such as age, BMI, height, and cholesterol, contribute equally to the analysis. Without standardization, variables with larger ranges could dominate model performance, leading to biased results. By standardizing these features, we improve model accuracy, enable better comparison between variables, and enhance the efficiency of machine learning algorithms, particularly in models that rely on distance-based or gradient-based methods.


```{r}

# Specify the columns to be standardized
columns_to_standardize <- c("age", "BMI", "height", "body.weight", "cholestrol", "avg.dbp", "avg.sbp")

# Standardize the selected columns
imputed_nhanes[columns_to_standardize] <- scale(imputed_nhanes[columns_to_standardize])

  
```





# FEATURE SELECTION

Feature selection is a critical step in model building, as it helps improve model performance by identifying the most relevant predictors and reducing overfitting. In this analysis, both model-based selection and regularization-based selection techniques will be used to assess and retain the most influential features for accurate prediction, ensuring a more efficient and interpretable model.


A full model is created to predict cholesterol.

```{r}

#sapply(imputed_nhanes, function(x) length(unique(x)))

#head(imputed_nhanes, 25)


#head(imputed_nhanes)


model_full <- lm(cholestrol ~ age + race + BMI + height + body.weight + 
                  hbp + avg.dbp + avg.sbp + obese, 
                  data = imputed_nhanes)

# Summary of the full model
#summary(model_full)

# Tidy the model summary
model_summary <- tidy(model_full)

# Calculate Mean Squared Error (MSE)
mse <- mean(model_full$residuals^2)

# Calculate R-squared
r2 <- summary(model_full)$r.squared

# Create a data frame for MSE and R-squared
metrics_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                         Value = c(mse, r2))

# Print the model summary using kable
kable(model_summary, digits = 3, caption = "Regression Model Summary")

# Print MSE and R-squared using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics")

```

Stepwise model selection is conducted to retain significant predictors. Selection processes based on AIC and BIC are conducted. 

```{r}
# Stepwise model selection based on AIC
model_stepwise_AIC <- step(model_full, direction = "both", trace = 0)

# Stepwise model selection based on BIC
model_stepwise_BIC <- step(model_full, direction = "both", k = log(nrow(imputed_nhanes)), trace = 0)


# Tidy the summaries for both models
stepwise_AIC_summary <- tidy(model_stepwise_AIC)
stepwise_BIC_summary <- tidy(model_stepwise_BIC)

# Compute MSE for both models
mse_AIC <- mean(model_stepwise_AIC$residuals^2)
mse_BIC <- mean(model_stepwise_BIC$residuals^2)

# Compute R-squared for both models
r2_AIC <- summary(model_stepwise_AIC)$r.squared
r2_BIC <- summary(model_stepwise_BIC)$r.squared

# Create data frames for MSE and R-squared
metrics_AIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_AIC, r2_AIC))

metrics_BIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_BIC, r2_BIC))

# Print the stepwise AIC model summary
kable(stepwise_AIC_summary, digits = 3, caption = "Stepwise AIC Regression Model Summary")

# Print performance metrics for AIC model
kable(metrics_AIC_df, digits = 3, caption = "Model Performance Metrics (AIC)")

# Print the stepwise BIC model summary
kable(stepwise_BIC_summary, digits = 3, caption = "Stepwise BIC Regression Model Summary")

# Print performance metrics for BIC model
kable(metrics_BIC_df, digits = 3, caption = "Model Performance Metrics (BIC)")
```



```{r}
# Check R-squared value
summary(model_stepwise_AIC)$r.squared
summary(model_stepwise_BIC)$r.squared


```

Both AIC and BIC stepwise model selection processes have similar R squared values. The BIC stepwise model is parsimonious. This model selects age, height, body weight, and average diastolic blood pressure as predictors for cholesterol. 



Regularization-based selection techniques are useful for improving model performance and preventing overfitting, especially when you have many predictors. Two common regularization methods are Ridge regression and Lasso regression.

```{r}

# Extract the response and predictors
y <- imputed_nhanes$cholestrol  # Response variable
X <- as.matrix(imputed_nhanes[, c("age", "BMI", "height", "body.weight", "avg.dbp", "avg.sbp")])

# Optional: Scale predictors (recommended for Ridge and Lasso)
X_scaled <- scale(X)

# Fit Ridge and Lasso models
ridge_model <- glmnet(X_scaled, y, alpha = 0)  # Ridge Regression (alpha = 0)
lasso_model <- glmnet(X_scaled, y, alpha = 1)  # Lasso Regression (alpha = 1)

# Ridge cross-validation
cv_ridge <- cv.glmnet(X_scaled, y, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min

# Lasso cross-validation
cv_lasso <- cv.glmnet(X_scaled, y, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

# Extract coefficients for Ridge and Lasso at optimal lambda
ridge_coefficients <- as.data.frame(as.matrix(coef(cv_ridge, s = "lambda.min")))
colnames(ridge_coefficients) <- c("Coefficient")
ridge_coefficients$Predictor <- rownames(ridge_coefficients)

lasso_coefficients <- as.data.frame(as.matrix(coef(cv_lasso, s = "lambda.min")))
colnames(lasso_coefficients) <- c("Coefficient")
lasso_coefficients$Predictor <- rownames(lasso_coefficients)

# Predictions for Ridge
pred_ridge <- predict(cv_ridge, X_scaled, s = "lambda.min")

# Predictions for Lasso
pred_lasso <- predict(cv_lasso, X_scaled, s = "lambda.min")

# Calculate RMSE for Ridge
rmse_ridge <- sqrt(mean((pred_ridge - y)^2))

# Calculate RMSE for Lasso
rmse_lasso <- sqrt(mean((pred_lasso - y)^2))

# Calculate R² for Ridge
rss_ridge <- sum((pred_ridge - y)^2)  # Residual Sum of Squares
tss_ridge <- sum((y - mean(y))^2)    # Total Sum of Squares
r2_ridge <- 1 - rss_ridge / tss_ridge

# Calculate R² for Lasso
rss_lasso <- sum((pred_lasso - y)^2)
tss_lasso <- sum((y - mean(y))^2)
r2_lasso <- 1 - rss_lasso / tss_lasso

# Create a data frame for RMSE and R²
metrics_df <- data.frame(
  Model = c("Ridge", "Lasso"),
  RMSE = c(rmse_ridge, rmse_lasso),
  R2 = c(r2_ridge, r2_lasso)
)

# Print Ridge coefficients using kable
kable(ridge_coefficients, digits = 3, caption = "Ridge Regression Coefficients")

# Print Lasso coefficients using kable
kable(lasso_coefficients, digits = 3, caption = "Lasso Regression Coefficients")

# Print RMSE and R² values using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics (RMSE & R²)")
```


The Ridge and Lasso method both had similar R squared values. Therefore, both models are recommended. 

# LINEAR REGRESSION

In this regression analysis of the NHANES dataset, three models are developed using cross validation to predict the outcome variable cholesterol: a full model including all selected predictors, a stepwise BIC model for variable selection, and a Ridge regression model to address multicollinearity. Each model was evaluated based on its predictive performance, with Mean Squared Error (MSE) and R-squared (R^2) used as key comparison metrics. By analyzing these models, the aim is to determine the most accurate and efficient approach for modeling the data while balancing complexity and predictive power.

```{r}

set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

```


## Full Model

The full model includes all selected predictor variables, providing a comprehensive approach to estimating cholesterol levels. To assess its stability, we implement k-fold cross-validation, which systematically partitions the dataset into multiple training and validation subsets. This approach mitigates the risk of overfitting by testing the model’s performance on unseen data, yielding a robust estimate of prediction error. The model’s effectiveness is quantified using the Mean Squared Error (MSE), which measures the average squared difference between observed and predicted cholesterol values. By employing cross-validation, we ensure that the full model is evaluated rigorously, providing insights into its accuracy and reliability in real-world applications.


```{r}

# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)  # 10-fold CV

# Fit the full model
full_model <- train(cholestrol ~ age + race + log_BMI + height + log_body.weight + 
                    hbp + avg.dbp + log_avg.sbp + obese, 
                    data = imputed_nhanes, 
                    method = "lm", 
                    trControl = train_control)

# Extract coefficients from the final model
coeff_table <- summary(full_model$finalModel)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column(var = "Predictor")  # Convert row names to a column

# Calculate Cross-Validation MSE
mse <- mean(full_model$resample$RMSE^2)

# Calculate R² from the final model
r2 <- summary(full_model$finalModel)$r.squared

# Create a data frame for MSE and R²
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared"),
  Value = c(mse, r2)
)

# Print the coefficients using kable
kable(coeff_table, digits = 4, caption = "Regression Model Coefficients", format = "markdown")

# Print MSE and R² using kable
kable(metrics_df, digits = 4, caption = "Model Performance Metrics (MSE & R²)", format = "markdown")
```



## Stepwise BIC

In this analysis, a Stepwise BIC regression model is implemented to identify the most parsimonious subset of predictors for estimating cholesterol levels. The Stepwise BIC approach selects variables by minimizing the Bayesian Information Criterion (BIC), which penalizes model complexity to prevent overfitting. To assess the model’s generalizability, we employ cross-validation by splitting the dataset into training and testing subsets. The model is trained on the training data, and predictions are made on the test data to compute the Mean Squared Error (MSE) as a measure of model performance. Additionally, the estimated regression coefficients in a structured table for clarity and interpretation. This approach ensures an optimal balance between model complexity and predictive accuracy while validating the model’s robustness.

```{r}

stepwise_bic_model <- function(train_data, test_data) {
  # Fit initial full model
  full_model2 <- lm(cholestrol ~ age + height + log_body.weight + avg.dbp, data = train_data)
  
  # Perform stepwise selection using BIC
  model_stepwise_BIC <- stepAIC(full_model2, direction = "both", k = log(nrow(train_data)), trace = FALSE)
  
  # Extract coefficients
  coeff_table <- summary(model_stepwise_BIC)$coefficients %>%
    as.data.frame() %>%
    rownames_to_column(var = "Predictor")  # Convert row names to a column

  # Print coefficients using kable
  print(kable(coeff_table, digits = 4, caption = "Stepwise BIC Model Coefficients", format = "markdown"))

  # Predict on test data
  predictions <- predict(model_stepwise_BIC, newdata = test_data)
  
  # Compute Mean Squared Error (MSE)
  mse <- mean((test_data$cholestrol - predictions)^2)

  # Compute R²
  rss <- sum((test_data$cholestrol - predictions)^2)  # Residual Sum of Squares
  tss <- sum((test_data$cholestrol - mean(test_data$cholestrol))^2)  # Total Sum of Squares
  r2 <- 1 - rss/tss  # R-squared
  
  # Create a data frame for MSE and R²
  metrics_df <- data.frame(
    Statistic = c("Mean Squared Error", "R-squared"),
    Value = c(mse, r2)
  )

  # Print MSE and R² using kable
  print(kable(metrics_df, digits = 4, caption = "Stepwise BIC Model Performance Metrics (MSE & R²)", format = "markdown"))

  return(list(model = model_stepwise_BIC, mse = mse, r2 = r2))  # Return the model, MSE, and R²
}

# Example usage (Assuming you have a dataset split into train_data and test_data)
set.seed(123)
train_index <- createDataPartition(imputed_nhanes$cholestrol, p = 0.8, list = FALSE)
train_data <- imputed_nhanes[train_index, ]
test_data <- imputed_nhanes[-train_index, ]

# Run the function
stepwise_results <- stepwise_bic_model(train_data, test_data)


```


## Ridge Regression

Ridge regression is applied to the NHANES dataset to address multicollinearity among predictor variables while maintaining model stability. By imposing a penalty on large coefficients through L2 regularization, Ridge regression prevents overfitting and improves generalization. The model was evaluated using cross-validation, and its performance was compared to other regression approaches based on Mean Squared Error (MSE) and R².

```{r}

# Define predictor matrix (X) and response variable (y)
X <- as.matrix(imputed_nhanes[, c("age", "log_BMI", "height", "log_body.weight", "avg.dbp", "log_avg.sbp")])
y <- imputed_nhanes$cholestrol

set.seed(123)  # For reproducibility

# Define training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train Ridge Regression Model
ridge_model <- train(
  x = X, y = y,
  method = "glmnet",
  trControl = train_control,
  tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(-3, 3, length = 100))  # Ridge: alpha = 0
)

# Best lambda value selected via cross-validation
best_lambda <- ridge_model$bestTune$lambda

# Predict using best lambda
ridge_predictions <- predict(ridge_model, X)

# Compute Mean Squared Error (MSE)
ridge_mse <- mean((y - ridge_predictions)^2)

# Compute R²
rss <- sum((y - ridge_predictions)^2)  # Residual Sum of Squares
tss <- sum((y - mean(y))^2)  # Total Sum of Squares
ridge_r2 <- 1 - (rss / tss)  # R² calculation

# Extract coefficients from the final Ridge model
coefficients <- as.data.frame(as.matrix(ridge_model$finalModel$beta)) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

#print(kable(head(coefficients, 10), digits = 4, caption = "Top 10 Ridge Regression Coefficients", format = "markdown"))


# Create a data frame for MSE, R², and Best Lambda
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared", "Optimal Lambda"),
  Value = c(ridge_mse, ridge_r2, best_lambda)
)

# Print model performance metrics using kable()
print(kable(metrics_df, digits = 4, caption = "Ridge Regression Performance Metrics", format = "markdown"))
```



## Comparison

After performing cross-validation on the NHANES_imputed dataset, the predictive performance of the thre models are compared. The BIC model has the smallest MSE and highest R^2 value, indicating a better performing regression model. In, turn, the BIC model is recommended. 




# LOGISTIC REGRESSION

This analysis applies logistic regression to predict current smoking status using data from NHANES. Four models are developed: a full model including all relevant predictors, a stepwise AIC model for variable selection, a BIC model for a more parsimonious approach, and a regularized logistic regression model using Lasso to prevent overfitting. The models are evaluated and compared using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) to assess their classification performance and predictive accuracy.

## Full Model

The first candidate, a full model, is fitted to predict current smoking status. 

```{r}

# Fit logistic regression model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Extract coefficients from the model
coefficients <- summary(full_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table <- as.data.frame(coefficients) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table, digits = 4, caption = "Logistic Regression Model Coefficients", format = "markdown")



```

Age, race, and average dbp are significant features in the full model. 

## Stepwise AIC Model

The second model, stepwise AIC, is fitted to predict current smoking status. This model automatically selects a subset of predictors using AIC-based stepwise selection. 

```{r}


# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using AIC
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# Extract coefficients from the stepwise model
coefficients_stepwise <- summary(stepwise_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_stepwise <- as.data.frame(coefficients_stepwise) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_stepwise, digits = 4, caption = "Stepwise Logistic Regression Model Coefficients", format = "markdown")



```

All features, other than obese, are significant in the stepwise AIC model. 


## BIC Model

The third candidate,a BIC model, is fitted to predict current smoking status. This model fits a more parsimonious model. 

```{r}

# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using BIC (using k = log(nrow(imputed_nhanes)))
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Extract coefficients from the BIC model
coefficients_bic <- summary(bic_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_bic <- as.data.frame(coefficients_bic) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_bic, digits = 4, caption = "BIC Stepwise Model Coefficients", format = "markdown")


```

All features are significant in the BIC model.


## Regularized Logistic Regression Model (Lasso)

The fourth candidate model uses Regularized Logistic Regression (Lasso). This method selects the most important predictors.

```{r}

# Define predictor matrix (X) and response variable (y)
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk

# Train the Lasso model using cross-validation
lasso_model <- cv.glmnet(x, y, family = "binomial", alpha = 1)


# Best lambda value selected via cross-validation
best_lambda_lasso <- lasso_model$lambda.min
print(paste("Optimal Lambda for Lasso:", best_lambda_lasso))

# Extract coefficients from the final Lasso model at the optimal lambda
lasso_coefficients <- coef(lasso_model, s = "lambda.min")

# Convert coefficients to a data frame
lasso_coefficients_df <- as.data.frame(as.matrix(lasso_coefficients))
lasso_coefficients_df$Predictor <- rownames(lasso_coefficients_df)

# Reorder columns to make predictor names the first column
lasso_coefficients_df <- lasso_coefficients_df %>%
  select(Predictor, everything())

# Print the coefficients table using kable
kable(lasso_coefficients_df, digits = 4, caption = "Lasso Regression Coefficients", format = "markdown")


```


## Comparison

In this analysis, four logistic regression models—Full Model, Stepwise AIC Model, BIC Model, and Lasso Model are compared using ROC (Receiver Operating Characteristic) curves and AUC (Area Under the Curve) to assess their predictive performance.  The optimal cutoff for each model using Youden’s Index is determined, which maximizes the balance between sensitivity and specificity. By evaluating these models through both graphical and numerical metrics, insights is gained into which model provides the best classification performance for predicting the outcome variable.

```{r}
# Full Model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, data = imputed_nhanes, family = binomial)

# Stepwise AIC Model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# BIC Model
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Lasso Model
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp , data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk
lasso_cv <- cv.glmnet(x, y, family = "binomial", alpha = 1)
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = lasso_cv$lambda.min)

# Get predicted probabilities
imputed_nhanes$full_pred <- predict(full_model, type = "response")
imputed_nhanes$stepwise_pred <- predict(stepwise_model, type = "response")
imputed_nhanes$bic_pred <- predict(bic_model, type = "response")
imputed_nhanes$lasso_pred <- predict(lasso_model, newx = x, type = "response")[,1]

# Create ROC curves
full_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$full_pred)
stepwise_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$stepwise_pred)
bic_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$bic_pred)
lasso_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$lasso_pred)

# Calculate optimal cutoff using Youden's Index
optimal_cutoff_full <- full_roc$specificities[which.max(full_roc$sensitivities + full_roc$specificities - 1)]
optimal_cutoff_stepwise <- stepwise_roc$specificities[which.max(stepwise_roc$sensitivities + stepwise_roc$specificities - 1)]
optimal_cutoff_bic <- bic_roc$specificities[which.max(bic_roc$sensitivities + bic_roc$specificities - 1)]
optimal_cutoff_lasso <- lasso_roc$specificities[which.max(lasso_roc$sensitivities + lasso_roc$specificities - 1)]

# Print optimal cutoffs
cat("Optimal cutoff for Full Model:", optimal_cutoff_full, "\n")
cat("Optimal cutoff for Stepwise AIC Model:", optimal_cutoff_stepwise, "\n")
cat("Optimal cutoff for BIC Model:", optimal_cutoff_bic, "\n")
cat("Optimal cutoff for Lasso Model:", optimal_cutoff_lasso, "\n")

# Plot all ROC curves
ggplot() +
  geom_line(aes(x = full_roc$specificities, y = full_roc$sensitivities, color = "Full Model")) +
  geom_line(aes(x = stepwise_roc$specificities, y = stepwise_roc$sensitivities, color = "Stepwise AIC")) +
  geom_line(aes(x = bic_roc$specificities, y = bic_roc$sensitivities, color = "BIC Model")) +
  geom_line(aes(x = lasso_roc$specificities, y = lasso_roc$sensitivities, color = "Lasso Model")) +
  labs(title = "ROC Curve Comparison", x = "1 - Specificity", y = "Sensitivity") +
  scale_color_manual(values = c("red", "blue", "green", "purple", "orange")) +
  theme_minimal()

# AUC values
auc_values <- data.frame(
  Model = c("Full Model", "Stepwise AIC", "BIC Model", "Lasso Model"),
  AUC = c(auc(full_roc), auc(stepwise_roc), auc(bic_roc), auc(lasso_roc))
)

print(auc_values)
```

All models have very similar AUC. In turn, the more parsimonious model, BIC, is recommended. 

# CONCLUSION

In this analysis, a comprehensive approach was taken to explore and model the NHANES dataset. Beginning with Exploratory Data Analysis (EDA), patterns and missing values were identified, leading to the implementation of Multiple Imputation by Chained Equations (MICE) to handle missing data. Feature engineering and transformation were applied to enhance model performance, followed by feature selection to refine predictor variables. Both linear and logistic regression models were developed and evaluated, including full models, stepwise selection methods, and regularized regression techniques. Ultimately, the Bayesian Information Criterion (BIC) emerged as the best-performing model for both linear and logistic regression, demonstrating its effectiveness in balancing model complexity and predictive accuracy.
